A Minimalist R Cheatsheet for NGS Biology

While teaching R to biologists, a common complaint I hear is that “there are too many functions”. Therefore, I try to take a minimalist approach and not introduce students new functions unless necessary. Using the existing functions has two benefits - (i) it keeps the brain free from too many function names, (ii) it allows students to get more practice on the existing ones.

Here is a minimalist cheatsheet for NGS data analysis. It will make most sense, if you join our remotely taught bioinformatics classes. I am working on the file actively with the goal of removing materials, if possible. Also, for statistical analysis and plots, this file includes only the simple functions. I plan to post two separate minimalist cheatsheets on those topics later.

1. Installing Packages

R Packages can be installed in (at least) three ways.

Package Source Method
CRAN install.packages("tidyverse")
Bioconductor install.packages("BiocManager"); biocLite("Biostrings")
github install.packages("devtools"); library(devtools); install_github("homologus/rnaseq.work")

Important - Answer “n” to the Following Question

Bioconductor installations often ask this question. Answering “n” will help you maintain your sanity.

Answer “n” here

Answer “n” here

Load Installed Packages

Type “library” command to load the installed package for use. You install a package once, but run “library” every time you open a new R window.


2. R as a Calculator

R as an Ordinary Calculator

Name Action Example
“+” Add 5+2
“-” Subtract 5-2
“*“ Multiply 5*2
“/” Divide 5/2
“^” Power 5^2
“%%” Quotient in integer division 5%%2
“%/%” Remainder in integer dividion 5%/%2
“pi” Constant “pi” 2*pi
“exp(1)” Constant “e” exp(pi)

R as a Scientific Calculator

Now that you have R installed, let us use the software for data analysis. This section covers simple mathematical operations so that R can replace your scientific calculator.

Name Action Example
abs Absolute number x=-7; abs(x)
sqrt Square root of a number sqrt(2)
exp Exponential function exp(2.7)
log Natural logarithm log(2.7)
log10 Logarithm with base 10 log10(2.7)

3. Piping

Piping operator comes from “tidyverse”. Make sure you load the package.

Name Action Example
%>% Rewrites function without parenthesis 2 %>% sqrt %>% log

The following two commands are equivalent.

## [1] 1.414214
2 %>% sqrt
## [1] 1.414214

Multistep piping -

22 %>% exp %>% log
## [1] 22

You can read it as ‘take 22 and do exponential and do log’. These two steps, one after another, should give you back the original number.

4. Vectors

R programming language is built on top of vectors. Here we show five ways to create them.

Different method for creating vectors
Method Description
Function ‘c’ Vector with given values
Function ‘:’ Vector with a range of numbers
Function ‘seq’ Vector with equal spacing
Function ‘rep’ Vector with identical numbers
Function ‘sample’ Random vector

(i) The function ‘c’

Use the function ‘c’ to create an arbitrary vector. After a vector is created, it can be accessed entirely or by positions. Remember that the index of the first position is 1, not zero like other programming languages (C, Java, Python).

## [1] 22


c("John", "Juan", "Jason")
## [1] "John"  "Juan"  "Jason"

Logical Vector


(ii) The function ‘:’

Increasing integers -

## [1]  2  3  4  5  6  7  8  9 10

Decreasing integers -

## [1] 7 6 5 4 3

(iii) The function ‘seq’

## [1] 3 5 7 9

(iv) The function ‘rep’

##  [1] 5 5 5 5 5 5 5 5 5 5

(v) The function ‘sample’

##  [1] "T" "T" "H" "H" "T" "T" "H" "T" "H" "H"

There are additional functions to create random vectors with binomial, Gaussian (bell curve) and other distributions.

Concatenating Vectors

You can also combine vectors generated by the above methods. The function ‘c’ merges vectors of different types.


##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  7 11

5. Functions Operating on Vectors

A number of functions do not exist on scientific calculators, because they apply only on vectors. Sum of a vector is a good example.

Name Action
head First few elements of a vector
tail Last few elements of a vector
sum Sum of elements of a number vector
mean Mean of elements of a number vector
median Median
sd Standard Deviation
var Variance
summary Summary statistics
table Counts the elements of a vector
vec = c(1, 22,33,44,54,1,2,97, 22)
vec %>% head
## [1]  1 22 33 44 54  1
vec %>% head(1)
## [1] 1
vec %>% tail(3)
## [1]  2 97 22
vec %>% sum
## [1] 276
vec %>% mean
## [1] 30.66667
vec %>% median
## [1] 22
vec %>% sd
## [1] 31.34486
vec %>% var
## [1] 982.5
vec %>% summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   22.00   30.67   44.00   97.00
vec %>% table
## .
##  1  2 22 33 44 54 97 
##  2  1  2  1  1  1  1

6. Data Frames (Spreadsheets)

The following functions are described in this section -

Name Action Example
data.frame Combines vectors into matrix df=data.frame(age,name,weight)
colnames Names of columns of data frame colnames(df)
rownames Names of rows of data frame rownames(df)
dim Dimension of data frame dim(df)
head First few lines of data frame head(df)
tail Last few lines of data frame tail(df)
data Check which in-built data sets are available data()
data Load an in-built data set data("darwin")

Spreadsheets are called ‘data frames’ in R. A data frame in R holds a table of data, where the data in different columns can be of different types. A data frame is not the same as matrix in mathematics.

Usually you read your data from Excel or csv files into data frames. Here we manually create one to show the commands.

Each column of R data frame is a vector. So, a data frame can be created by combining a group of equal sized vectors with the command ‘data.frame’.

name=c("Alex", "Ada", "Chen", "Kim", "John")
weight=c(80.2, 70.1, 92.3, 77.7, 68.2) 

##   name weight age
## 1 Alex   80.2  20
## 2  Ada   70.1  21
## 3 Chen   92.3  20
## 4  Kim   77.7  17
## 5 John   68.2  19

The size of the data frame can be obtained by using the ‘dim()’ command. The function ‘colnames()’ gives the names of the columns, and the function ‘head()’ provides a snap shot.

df %>% dim
## [1] 5 3
df %>% colnames
## [1] "name"   "weight" "age"
df %>% head
##   name weight age
## 1 Alex   80.2  20
## 2  Ada   70.1  21
## 3 Chen   92.3  20
## 4  Kim   77.7  17
## 5 John   68.2  19

It is possible to access the individual columns of a data frame in a number of ways.

## [1] 80.2 70.1 92.3 77.7 68.2
##   weight
## 1   80.2
## 2   70.1
## 3   92.3
## 4   77.7
## 5   68.2
## [1] 80.2 70.1 92.3 77.7 68.2

A row can be accessed using the following command.

##   name weight age
## 3 Chen   92.3  20

It is also possible to access multiple rows or commands.

##   name weight age
## 2  Ada   70.1  21
## 3 Chen   92.3  20
## 4  Kim   77.7  17

7. Data Visualization

Core R comes with a number of plotting functions, but here we cover only two - hist (to draw histogram) and plot (to draw scatterplots). For more extensive plotting tasks, we recommend the readers to learn and use the powerful ggplot package.

Name Action
hist Draw histogram
plot Draw scatterplot

Using hist()


Using plot()

The plot() function can be used to draw scatter-plots. I takes two equal-sized vectors as input and draws all corresponding points from the vector as (x,y).


8. Dplyr Library

This packages is used to find information from data frames. Functions discussed here -

Name Action
select selects a subset of columns
mutate creates new columns based on some rule
arrange sorts columns
filter picks rows based on some rule

Create Mock Spreadsheet with Data

Our simple data set has experimental data for ten genes from two sets of experiments on heart, kidney and brain. The columns in the spreadsheet are gene-ID, annotation, heart1, kidney1, brain1, heart2, kidney2, brain2. The spreadsheet has ten rows, each with information on one gene. Let us create a spreadsheet for that experiment with totally nonsensical data.

Let us create one -

gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')

annot=c('transcription', 'metabolism', 'translation', 'transcription', 'cell-cycle', 'cell-cycle', 'translation', 'receptor', 'transcription', 'metabolism')







expt=data.frame(gene, annot, heart1,kidney1,brain1,heart2, kidney2, brain2)

##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1   gene1 transcription     10       3      4      2      10      8
## 2   gene2    metabolism      3       4      5      5       3      2
## 3   gene3   translation      4       5      8      1       4      7
## 4   gene4 transcription      5       8      9      9       5      2
## 5   gene5    cell-cycle      8       9      1      1       1      1
## 6   gene6    cell-cycle      9       1      2      2       2      2
## 7   gene7   translation      1       2      4      4       4      4
## 8   gene8      receptor      2       4      5      5       5      5
## 9   gene9 transcription      4       5      3      1       4      3
## 10 gene10    metabolism      5      10      2     12       9      2

Manipulating Columns (‘select’ and ‘mutate’)

You learn two functions - select() and mutate().

  1. select() creates a second spreadsheet, where the columns of original are removed/rearranged.

  2. mutate() creates a new spreadsheet, where an extra column is added by combining existing columns.

Task 1. Create a New Spreadsheet by Rearranging Columns

expt %>% select(gene, brain1, brain2, heart1, heart2, kidney1, kidney2)
##      gene brain1 brain2 heart1 heart2 kidney1 kidney2
## 1   gene1      4      8     10      2       3      10
## 2   gene2      5      2      3      5       4       3
## 3   gene3      8      7      4      1       5       4
## 4   gene4      9      2      5      9       8       5
## 5   gene5      1      1      8      1       9       1
## 6   gene6      2      2      9      2       1       2
## 7   gene7      4      4      1      4       2       4
## 8   gene8      5      5      2      5       4       5
## 9   gene9      3      3      4      1       5       4
## 10 gene10      2      2      5     12      10       9

Note that the above commands displays the new spreadsheet on the screen, but does not store it. To save it, create a new variable.

expt2=expt %>% select(gene, brain1, brain2, heart1, heart2, kidney1, kidney2)

Task 2. Create a New Spreadsheet by Choosing a Subset of Columns

expt %>% select(gene, brain1, heart1, kidney1)
##      gene brain1 heart1 kidney1
## 1   gene1      4     10       3
## 2   gene2      5      3       4
## 3   gene3      8      4       5
## 4   gene4      9      5       8
## 5   gene5      1      8       9
## 6   gene6      2      9       1
## 7   gene7      4      1       2
## 8   gene8      5      2       4
## 9   gene9      3      4       5
## 10 gene10      2      5      10
expt %>% select(gene, brain2, heart2, kidney2)
##      gene brain2 heart2 kidney2
## 1   gene1      8      2      10
## 2   gene2      2      5       3
## 3   gene3      7      1       4
## 4   gene4      2      9       5
## 5   gene5      1      1       1
## 6   gene6      2      2       2
## 7   gene7      4      4       4
## 8   gene8      5      5       5
## 9   gene9      3      1       4
## 10 gene10      2     12       9

**Task 3. ‘begins_with’, ’ends_with“**

expt %>% select(gene, starts_with("heart"))
##      gene heart1 heart2
## 1   gene1     10      2
## 2   gene2      3      5
## 3   gene3      4      1
## 4   gene4      5      9
## 5   gene5      8      1
## 6   gene6      9      2
## 7   gene7      1      4
## 8   gene8      2      5
## 9   gene9      4      1
## 10 gene10      5     12
expt %>% select(gene, ends_with("2"))
##      gene heart2 kidney2 brain2
## 1   gene1      2      10      8
## 2   gene2      5       3      2
## 3   gene3      1       4      7
## 4   gene4      9       5      2
## 5   gene5      1       1      1
## 6   gene6      2       2      2
## 7   gene7      4       4      4
## 8   gene8      5       5      5
## 9   gene9      1       4      3
## 10 gene10     12       9      2

Task 4. Replace a Column by its Half

expt %>% mutate(brain1=brain1/2)
##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1   gene1 transcription     10       3    2.0      2      10      8
## 2   gene2    metabolism      3       4    2.5      5       3      2
## 3   gene3   translation      4       5    4.0      1       4      7
## 4   gene4 transcription      5       8    4.5      9       5      2
## 5   gene5    cell-cycle      8       9    0.5      1       1      1
## 6   gene6    cell-cycle      9       1    1.0      2       2      2
## 7   gene7   translation      1       2    2.0      4       4      4
## 8   gene8      receptor      2       4    2.5      5       5      5
## 9   gene9 transcription      4       5    1.5      1       4      3
## 10 gene10    metabolism      5      10    1.0     12       9      2

Task 5. Create New Columns by Combining Multiple Existing Columns

expt %>% mutate(sum1=brain1+kidney1+heart1, sum2=brain2+kidney2+heart2)
##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2 sum1
## 1   gene1 transcription     10       3      4      2      10      8   17
## 2   gene2    metabolism      3       4      5      5       3      2   12
## 3   gene3   translation      4       5      8      1       4      7   17
## 4   gene4 transcription      5       8      9      9       5      2   22
## 5   gene5    cell-cycle      8       9      1      1       1      1   18
## 6   gene6    cell-cycle      9       1      2      2       2      2   12
## 7   gene7   translation      1       2      4      4       4      4    7
## 8   gene8      receptor      2       4      5      5       5      5   11
## 9   gene9 transcription      4       5      3      1       4      3   12
## 10 gene10    metabolism      5      10      2     12       9      2   17
##    sum2
## 1    20
## 2    10
## 3    12
## 4    16
## 5     3
## 6     6
## 7    12
## 8    15
## 9     8
## 10   23

Sort Data in Column (‘arrange’)

** Increasing **

expt %>% arrange(brain1)
##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1   gene5    cell-cycle      8       9      1      1       1      1
## 2   gene6    cell-cycle      9       1      2      2       2      2
## 3  gene10    metabolism      5      10      2     12       9      2
## 4   gene9 transcription      4       5      3      1       4      3
## 5   gene1 transcription     10       3      4      2      10      8
## 6   gene7   translation      1       2      4      4       4      4
## 7   gene2    metabolism      3       4      5      5       3      2
## 8   gene8      receptor      2       4      5      5       5      5
## 9   gene3   translation      4       5      8      1       4      7
## 10  gene4 transcription      5       8      9      9       5      2

** Decreasing **

expt %>% arrange(-brain1)
##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1   gene4 transcription      5       8      9      9       5      2
## 2   gene3   translation      4       5      8      1       4      7
## 3   gene2    metabolism      3       4      5      5       3      2
## 4   gene8      receptor      2       4      5      5       5      5
## 5   gene1 transcription     10       3      4      2      10      8
## 6   gene7   translation      1       2      4      4       4      4
## 7   gene9 transcription      4       5      3      1       4      3
## 8   gene6    cell-cycle      9       1      2      2       2      2
## 9  gene10    metabolism      5      10      2     12       9      2
## 10  gene5    cell-cycle      8       9      1      1       1      1

Choose Subset of Rows (‘filter’)

Task 1. Extract Rows above a Cutoff

expt %>% filter(brain1>5)
##    gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene3   translation      4       5      8      1       4      7
## 2 gene4 transcription      5       8      9      9       5      2

Task 2. Extract Rows with Certain Annotation

expt %>% filter(annot=="cell-cycle")
##    gene      annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene5 cell-cycle      8       9      1      1       1      1
## 2 gene6 cell-cycle      9       1      2      2       2      2

Task 3. Extract Rows Based on a Cutoff Involving Multiple Columns

expt %>% filter(brain1+brain2>4)
##    gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene1 transcription     10       3      4      2      10      8
## 2 gene2    metabolism      3       4      5      5       3      2
## 3 gene3   translation      4       5      8      1       4      7
## 4 gene4 transcription      5       8      9      9       5      2
## 5 gene7   translation      1       2      4      4       4      4
## 6 gene8      receptor      2       4      5      5       5      5
## 7 gene9 transcription      4       5      3      1       4      3

Combine Multiple Tasks using ‘%>%’

** Example 1 **

expt %>% filter(brain1>kidney1/2) %>% filter(brain2>kidney2/2) %>% select(gene,annot)
##    gene         annot
## 1 gene1 transcription
## 2 gene2    metabolism
## 3 gene3   translation
## 4 gene6    cell-cycle
## 5 gene7   translation
## 6 gene8      receptor
## 7 gene9 transcription

** Example 2 **

expt %>% filter(brain1>kidney1/2) %>% filter(brain2>kidney2/2) %>% filter(annot=="transcription") %>% select(gene)
##    gene
## 1 gene1
## 2 gene9

Joining Data from Tables

Earlier we created the data frame ‘expt’ by combining gene, annot and experimental data from six tissues. In the data sets, we had 10 genes, their annotations and their expression levels.

gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
annot=c('transcription', 'metabolism', 'translation', 'transcription', 'cell-cycle', 'cell-cycle', 'translation', 'receptor', 'transcription', 'metabolism')

expt=data.frame(gene, annot, heart1,kidney1,brain1,heart2, kidney2, brain2)

Let’s say someone sends you another Excel spreadsheet with data from liver. What will you do? We will first create that new data frame with more mock data.

gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
expt2 = data.frame(gene,liver)

We want a combined spreadsheet from expt and expt2. The dplyr package has a function named ‘inner_join’ to accomplish that task.

## Joining, by = "gene"
##      gene         annot heart1 kidney1 brain1 heart2 kidney2 brain2 liver
## 1   gene1 transcription     10       3      4      2      10      8     7
## 2   gene2    metabolism      3       4      5      5       3      2    10
## 3   gene3   translation      4       5      8      1       4      7     1
## 4   gene4 transcription      5       8      9      9       5      2     4
## 5   gene5    cell-cycle      8       9      1      1       1      1     5
## 6   gene6    cell-cycle      9       1      2      2       2      2     7
## 7   gene7   translation      1       2      4      4       4      4     8
## 8   gene8      receptor      2       4      5      5       5      5     9
## 9   gene9 transcription      4       5      3      1       4      3     1
## 10 gene10    metabolism      5      10      2     12       9      2    11
## Joining, by = "gene"

The first command displays the combined data frame on the screen, whereas the second command saves it in the variable named ‘combined’. You can then use ‘select’, ‘filter’ etc on the combined data frame to find patterns.

Please note -

  1. Even if two data frames have rows in different orders, the function takes care of ordering them properly before joining data.

  2. If one gene is missing from one data frame or another, ‘inner_join’ will remove all information for the gene from the combined data frame.

  3. If you like to keep information on the missing genes, dplyr has a number of other functions. For example, ‘full_join’ will keep all genes and fill the missing places with ‘NA’. Check the following link for other options.

9. Reading Data from External Files

Functions discussed here -

Name Action
read_excel Read excel file
read_csv Read csv file
read_tsv Read tsv file
write_csv Write in csv format
getwd Get current working directory
setwd Change working directory

Now that we mastered doing simple calculations in R, let us take a step forward to start analyzing real data. The first step is to import tibble from the external files, stored typically in text (csv) or Excel (xls) format.

Find Current Working Directory

R works in a directory or folder of your operating system so that when you save a file, it gets saved in that folder. You can see where R will save your files, try ‘getwd()’. You can also change the working directory by using ‘setwd()’.

# [1] "C:/Users/Student/Documents"


Reading Excel Files

R provides at least two ways to read data from an Excel file. The first method uses an external library ‘readxl’. In order to use functions from the library, it needs to be loaded in R by using the command ‘library(readxl)’.


After that, the function ‘read_excel’ is used to read sheet 1 of the Excel file into a data frame (e.g. ‘mydata’). The concept of is explained below.

mydata <- read_excel("/full/path/to/excel/file/file.xls", sheet=1)

Reading csv Files

The other approach is to use Excel ‘save as’ function to save the spreadsheet in csv format. Then the csv file can be directly read from R using the ‘read_csv’ function. This function is already in tidyverse and does not require any additional library.

mydata <- read_csv("/full/path/to/csv/file/file.csv")

Writing Data Frame or Tibble in a File

The command ‘write_csv’ will save a data frame or tibble in a file in csv format.

mydata %>% write_csv("file.csv")

10. Bioconductor for DNA Data

dna %>% reverseComplement
##   12-letter "DNAString" instance
dna %>% translate
##   4-letter "AAString" instance
## seq: C*SS
dna %>% subseq(3,7)
##   5-letter "DNAString" instance
## seq: CTAAT
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## score: 13.79057

Reading FASTA Files

genome = readDNAStringSet("C:/Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa")
#genome <- readDNAStringSet("C://Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa")
genome %>% subseq(2038152,2038823) %>% reverseComplement %>% translate %>% toString

Extracting Sequence

genome <- DNAStringSet( list(chr1=DNAString("AATGGTCCGTG"), chr2=DNAString("TGGGTGGGTGG")) )
gr <- GRanges("chr1", IRanges(start=3,end=5), strand="+")
BSgenome::getSeq(genome, gr)
gr <- GRanges(c("chr1","chr2"), IRanges(start=c(1,3),end=c(4,5)), strand=c("+","-"))
BSgenome::getSeq(genome, gr)

Genomic Segment

ir = IRanges(start=c(1,100), width=c(10,10))
#genome <- readDNAStringSet("C://Users/Manoj/Desktop/R/web-tables/e")
#annot <- rtracklayer::import("C://Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.39.chromosome.Chromosome.gff3")
#seq <- BSgenome::getSeq(genome, annot[100,])

11. Changing Data Format to Switch Between Bioconductor and Tidyverse

gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')

expt=data.frame(gene, heart1,kidney1,brain1,heart2, kidney2, brain2)

Going from Tidyverse Style to Bioconductor Style

Tidyverse likes the above style, whereas Bioconductor wants the gene names as the names of the columns.

expt_bioc=expt %>% select(-gene) %>% as.matrix
##        heart1 kidney1 brain1 heart2 kidney2 brain2
## gene1      10       3      4      2      10      8
## gene2       3       4      5      5       3      2
## gene3       4       5      8      1       4      7
## gene4       5       8      9      9       5      2
## gene5       8       9      1      1       1      1
## gene6       9       1      2      2       2      2
## gene7       1       2      4      4       4      4
## gene8       2       4      5      5       5      5
## gene9       4       5      3      1       4      3
## gene10      5      10      2     12       9      2

Going back from Bioconductor Style to Tidyverse Style

If you stick to the Bioconductor format and operate tidyverse functions, your gene names will disappear. Therefore, you need to get them as a column first.

expt=expt_bioc %>% as.data.frame %>% rownames_to_column("gene")

Adding Row Number as Another Column

There are times you many also want to add the row number as a column. That task is simple, because you can apply “rownames_to_column” again.

expt=expt %>% rownames_to_column("id")