While teaching R to biologists, a common complaint I hear is that “there are too many functions”. Therefore, I try to take a minimalist approach and not introduce students new functions unless necessary. Using the existing functions has two benefits - (i) it keeps the brain free from too many function names, (ii) it allows students to get more practice on the existing ones.
Here is a minimalist cheatsheet for NGS data analysis. It will make most sense, if you join our remotely taught bioinformatics classes. I am working on the file actively with the goal of removing materials, if possible. Also, for statistical analysis and plots, this file includes only the simple functions. I plan to post two separate minimalist cheatsheets on those topics later.
R Packages can be installed in (at least) three ways.
Package Source | Method | |
---|---|---|
CRAN | install.packages("tidyverse") |
|
Bioconductor | install.packages("BiocManager"); biocLite("Biostrings") |
|
github | install.packages("devtools"); library(devtools); install_github("homologus/rnaseq.work") |
Bioconductor installations often ask this question. Answering “n” will help you maintain your sanity.
Type “library” command to load the installed package for use. You install a package once, but run “library” every time you open a new R window.
library("tidyverse")
Name | Action | Example |
---|---|---|
“+” | Add | 5+2 |
“-” | Subtract | 5-2 |
“*“ | Multiply | 5*2 |
“/” | Divide | 5/2 |
“^” | Power | 5^2 |
“%%” | Quotient in integer division | 5%%2 |
“%/%” | Remainder in integer dividion | 5%/%2 |
“pi” | Constant “pi” | 2*pi |
“exp(1)” | Constant “e” | exp(pi) |
Now that you have R installed, let us use the software for data analysis. This section covers simple mathematical operations so that R can replace your scientific calculator.
Name | Action | Example |
---|---|---|
abs | Absolute number | x=-7; abs(x) |
sqrt | Square root of a number | sqrt(2) |
exp | Exponential function | exp(2.7) |
log | Natural logarithm | log(2.7) |
log10 | Logarithm with base 10 | log10(2.7) |
Piping operator comes from “tidyverse”. Make sure you load the package.
Name | Action | Example |
---|---|---|
%>% | Rewrites function without parenthesis | 2 %>% sqrt %>% log |
The following two commands are equivalent.
sqrt(2)
## [1] 1.414214
2 %>% sqrt
## [1] 1.414214
Multistep piping -
22 %>% exp %>% log
## [1] 22
You can read it as ‘take 22 and do exponential and do log’. These two steps, one after another, should give you back the original number.
R programming language is built on top of vectors. Here we show five ways to create them.
Method | Description |
---|---|
Function ‘c’ | Vector with given values |
Function ‘:’ | Vector with a range of numbers |
Function ‘seq’ | Vector with equal spacing |
Function ‘rep’ | Vector with identical numbers |
Function ‘sample’ | Random vector |
Use the function ‘c’ to create an arbitrary vector. After a vector is created, it can be accessed entirely or by positions. Remember that the index of the first position is 1, not zero like other programming languages (C, Java, Python).
x=c(22,33,44,54,1,2,97)
x[1]
## [1] 22
c("John", "Juan", "Jason")
## [1] "John" "Juan" "Jason"
c(TRUE, FALSE, TRUE)
## [1] TRUE FALSE TRUE
Increasing integers -
2:10
## [1] 2 3 4 5 6 7 8 9 10
Decreasing integers -
7:3
## [1] 7 6 5 4 3
seq(3,10,2)
## [1] 3 5 7 9
rep(5,10)
## [1] 5 5 5 5 5 5 5 5 5 5
sample(c('H','T'),10,replace=TRUE)
## [1] "T" "T" "H" "H" "T" "T" "H" "T" "H" "H"
There are additional functions to create random vectors with binomial, Gaussian (bell curve) and other distributions.
You can also combine vectors generated by the above methods. The function ‘c’ merges vectors of different types.
v1=1:10
v2=c(1,2,7,11)
c(v1,v2)
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 7 11
A number of functions do not exist on scientific calculators, because they apply only on vectors. Sum of a vector is a good example.
Name | Action |
---|---|
head | First few elements of a vector |
tail | Last few elements of a vector |
sum | Sum of elements of a number vector |
mean | Mean of elements of a number vector |
median | Median |
sd | Standard Deviation |
var | Variance |
summary | Summary statistics |
table | Counts the elements of a vector |
vec = c(1, 22,33,44,54,1,2,97, 22)
vec %>% head
## [1] 1 22 33 44 54 1
vec %>% head(1)
## [1] 1
vec %>% tail(3)
## [1] 2 97 22
vec %>% sum
## [1] 276
vec %>% mean
## [1] 30.66667
vec %>% median
## [1] 22
vec %>% sd
## [1] 31.34486
vec %>% var
## [1] 982.5
vec %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 22.00 30.67 44.00 97.00
vec %>% table
## .
## 1 2 22 33 44 54 97
## 2 1 2 1 1 1 1
The following functions are described in this section -
Name | Action | Example |
---|---|---|
data.frame | Combines vectors into matrix | df=data.frame(age,name,weight) |
colnames | Names of columns of data frame | colnames(df) |
rownames | Names of rows of data frame | rownames(df) |
dim | Dimension of data frame | dim(df) |
head | First few lines of data frame | head(df) |
tail | Last few lines of data frame | tail(df) |
data | Check which in-built data sets are available | data() |
data | Load an in-built data set | data("darwin") |
Spreadsheets are called ‘data frames’ in R. A data frame in R holds a table of data, where the data in different columns can be of different types. A data frame is not the same as matrix in mathematics.
Usually you read your data from Excel or csv files into data frames. Here we manually create one to show the commands.
Each column of R data frame is a vector. So, a data frame can be created by combining a group of equal sized vectors with the command ‘data.frame’.
age=c(20,21,20,17,19)
name=c("Alex", "Ada", "Chen", "Kim", "John")
weight=c(80.2, 70.1, 92.3, 77.7, 68.2)
df=data.frame(name,weight,age)
df
## name weight age
## 1 Alex 80.2 20
## 2 Ada 70.1 21
## 3 Chen 92.3 20
## 4 Kim 77.7 17
## 5 John 68.2 19
The size of the data frame can be obtained by using the ‘dim()’ command. The function ‘colnames()’ gives the names of the columns, and the function ‘head()’ provides a snap shot.
df %>% dim
## [1] 5 3
df %>% colnames
## [1] "name" "weight" "age"
df %>% head
## name weight age
## 1 Alex 80.2 20
## 2 Ada 70.1 21
## 3 Chen 92.3 20
## 4 Kim 77.7 17
## 5 John 68.2 19
It is possible to access the individual columns of a data frame in a number of ways.
df$weight
## [1] 80.2 70.1 92.3 77.7 68.2
df[2]
## weight
## 1 80.2
## 2 70.1
## 3 92.3
## 4 77.7
## 5 68.2
df[,2]
## [1] 80.2 70.1 92.3 77.7 68.2
A row can be accessed using the following command.
df[3,]
## name weight age
## 3 Chen 92.3 20
It is also possible to access multiple rows or commands.
df[2:4,]
## name weight age
## 2 Ada 70.1 21
## 3 Chen 92.3 20
## 4 Kim 77.7 17
Core R comes with a number of plotting functions, but here we cover only two - hist (to draw histogram) and plot (to draw scatterplots). For more extensive plotting tasks, we recommend the readers to learn and use the powerful ggplot package.
Name | Action |
---|---|
hist | Draw histogram |
plot | Draw scatterplot |
x=c(rep(1,10),rep(2,10),rep(3,10))
hist(x)
The plot() function can be used to draw scatter-plots. I takes two equal-sized vectors as input and draws all corresponding points from the vector as (x,y).
v1=c(1,3,8,9,12)
v2=c(3,4,5,1,2)
plot(v1,v2)
This packages is used to find information from data frames. Functions discussed here -
Name | Action |
---|---|
select | selects a subset of columns |
mutate | creates new columns based on some rule |
arrange | sorts columns |
filter | picks rows based on some rule |
Our simple data set has experimental data for ten genes from two sets of experiments on heart, kidney and brain. The columns in the spreadsheet are gene-ID, annotation, heart1, kidney1, brain1, heart2, kidney2, brain2. The spreadsheet has ten rows, each with information on one gene. Let us create a spreadsheet for that experiment with totally nonsensical data.
Let us create one -
gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
annot=c('transcription', 'metabolism', 'translation', 'transcription', 'cell-cycle', 'cell-cycle', 'translation', 'receptor', 'transcription', 'metabolism')
heart1=c(10,3,4,5,8,9,1,2,4,5)
kidney1=c(3,4,5,8,9,1,2,4,5,10)
brain1=c(4,5,8,9,1,2,4,5,3,2)
heart2=c(2,5,1,9,1,2,4,5,1,12)
kidney2=c(10,3,4,5,1,2,4,5,4,9)
brain2=c(8,2,7,2,1,2,4,5,3,2)
expt=data.frame(gene, annot, heart1,kidney1,brain1,heart2, kidney2, brain2)
expt
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene1 transcription 10 3 4 2 10 8
## 2 gene2 metabolism 3 4 5 5 3 2
## 3 gene3 translation 4 5 8 1 4 7
## 4 gene4 transcription 5 8 9 9 5 2
## 5 gene5 cell-cycle 8 9 1 1 1 1
## 6 gene6 cell-cycle 9 1 2 2 2 2
## 7 gene7 translation 1 2 4 4 4 4
## 8 gene8 receptor 2 4 5 5 5 5
## 9 gene9 transcription 4 5 3 1 4 3
## 10 gene10 metabolism 5 10 2 12 9 2
You learn two functions - select() and mutate().
select() creates a second spreadsheet, where the columns of original are removed/rearranged.
mutate() creates a new spreadsheet, where an extra column is added by combining existing columns.
Task 1. Create a New Spreadsheet by Rearranging Columns
expt %>% select(gene, brain1, brain2, heart1, heart2, kidney1, kidney2)
## gene brain1 brain2 heart1 heart2 kidney1 kidney2
## 1 gene1 4 8 10 2 3 10
## 2 gene2 5 2 3 5 4 3
## 3 gene3 8 7 4 1 5 4
## 4 gene4 9 2 5 9 8 5
## 5 gene5 1 1 8 1 9 1
## 6 gene6 2 2 9 2 1 2
## 7 gene7 4 4 1 4 2 4
## 8 gene8 5 5 2 5 4 5
## 9 gene9 3 3 4 1 5 4
## 10 gene10 2 2 5 12 10 9
Note that the above commands displays the new spreadsheet on the screen, but does not store it. To save it, create a new variable.
expt2=expt %>% select(gene, brain1, brain2, heart1, heart2, kidney1, kidney2)
Task 2. Create a New Spreadsheet by Choosing a Subset of Columns
expt %>% select(gene, brain1, heart1, kidney1)
## gene brain1 heart1 kidney1
## 1 gene1 4 10 3
## 2 gene2 5 3 4
## 3 gene3 8 4 5
## 4 gene4 9 5 8
## 5 gene5 1 8 9
## 6 gene6 2 9 1
## 7 gene7 4 1 2
## 8 gene8 5 2 4
## 9 gene9 3 4 5
## 10 gene10 2 5 10
expt %>% select(gene, brain2, heart2, kidney2)
## gene brain2 heart2 kidney2
## 1 gene1 8 2 10
## 2 gene2 2 5 3
## 3 gene3 7 1 4
## 4 gene4 2 9 5
## 5 gene5 1 1 1
## 6 gene6 2 2 2
## 7 gene7 4 4 4
## 8 gene8 5 5 5
## 9 gene9 3 1 4
## 10 gene10 2 12 9
**Task 3. ‘begins_with’, ’ends_with“**
expt %>% select(gene, starts_with("heart"))
## gene heart1 heart2
## 1 gene1 10 2
## 2 gene2 3 5
## 3 gene3 4 1
## 4 gene4 5 9
## 5 gene5 8 1
## 6 gene6 9 2
## 7 gene7 1 4
## 8 gene8 2 5
## 9 gene9 4 1
## 10 gene10 5 12
expt %>% select(gene, ends_with("2"))
## gene heart2 kidney2 brain2
## 1 gene1 2 10 8
## 2 gene2 5 3 2
## 3 gene3 1 4 7
## 4 gene4 9 5 2
## 5 gene5 1 1 1
## 6 gene6 2 2 2
## 7 gene7 4 4 4
## 8 gene8 5 5 5
## 9 gene9 1 4 3
## 10 gene10 12 9 2
Task 4. Replace a Column by its Half
expt %>% mutate(brain1=brain1/2)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene1 transcription 10 3 2.0 2 10 8
## 2 gene2 metabolism 3 4 2.5 5 3 2
## 3 gene3 translation 4 5 4.0 1 4 7
## 4 gene4 transcription 5 8 4.5 9 5 2
## 5 gene5 cell-cycle 8 9 0.5 1 1 1
## 6 gene6 cell-cycle 9 1 1.0 2 2 2
## 7 gene7 translation 1 2 2.0 4 4 4
## 8 gene8 receptor 2 4 2.5 5 5 5
## 9 gene9 transcription 4 5 1.5 1 4 3
## 10 gene10 metabolism 5 10 1.0 12 9 2
Task 5. Create New Columns by Combining Multiple Existing Columns
expt %>% mutate(sum1=brain1+kidney1+heart1, sum2=brain2+kidney2+heart2)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2 sum1
## 1 gene1 transcription 10 3 4 2 10 8 17
## 2 gene2 metabolism 3 4 5 5 3 2 12
## 3 gene3 translation 4 5 8 1 4 7 17
## 4 gene4 transcription 5 8 9 9 5 2 22
## 5 gene5 cell-cycle 8 9 1 1 1 1 18
## 6 gene6 cell-cycle 9 1 2 2 2 2 12
## 7 gene7 translation 1 2 4 4 4 4 7
## 8 gene8 receptor 2 4 5 5 5 5 11
## 9 gene9 transcription 4 5 3 1 4 3 12
## 10 gene10 metabolism 5 10 2 12 9 2 17
## sum2
## 1 20
## 2 10
## 3 12
## 4 16
## 5 3
## 6 6
## 7 12
## 8 15
## 9 8
## 10 23
** Increasing **
expt %>% arrange(brain1)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene5 cell-cycle 8 9 1 1 1 1
## 2 gene6 cell-cycle 9 1 2 2 2 2
## 3 gene10 metabolism 5 10 2 12 9 2
## 4 gene9 transcription 4 5 3 1 4 3
## 5 gene1 transcription 10 3 4 2 10 8
## 6 gene7 translation 1 2 4 4 4 4
## 7 gene2 metabolism 3 4 5 5 3 2
## 8 gene8 receptor 2 4 5 5 5 5
## 9 gene3 translation 4 5 8 1 4 7
## 10 gene4 transcription 5 8 9 9 5 2
** Decreasing **
expt %>% arrange(-brain1)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene4 transcription 5 8 9 9 5 2
## 2 gene3 translation 4 5 8 1 4 7
## 3 gene2 metabolism 3 4 5 5 3 2
## 4 gene8 receptor 2 4 5 5 5 5
## 5 gene1 transcription 10 3 4 2 10 8
## 6 gene7 translation 1 2 4 4 4 4
## 7 gene9 transcription 4 5 3 1 4 3
## 8 gene6 cell-cycle 9 1 2 2 2 2
## 9 gene10 metabolism 5 10 2 12 9 2
## 10 gene5 cell-cycle 8 9 1 1 1 1
Task 1. Extract Rows above a Cutoff
expt %>% filter(brain1>5)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene3 translation 4 5 8 1 4 7
## 2 gene4 transcription 5 8 9 9 5 2
Task 2. Extract Rows with Certain Annotation
expt %>% filter(annot=="cell-cycle")
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene5 cell-cycle 8 9 1 1 1 1
## 2 gene6 cell-cycle 9 1 2 2 2 2
Task 3. Extract Rows Based on a Cutoff Involving Multiple Columns
expt %>% filter(brain1+brain2>4)
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2
## 1 gene1 transcription 10 3 4 2 10 8
## 2 gene2 metabolism 3 4 5 5 3 2
## 3 gene3 translation 4 5 8 1 4 7
## 4 gene4 transcription 5 8 9 9 5 2
## 5 gene7 translation 1 2 4 4 4 4
## 6 gene8 receptor 2 4 5 5 5 5
## 7 gene9 transcription 4 5 3 1 4 3
** Example 1 **
expt %>% filter(brain1>kidney1/2) %>% filter(brain2>kidney2/2) %>% select(gene,annot)
## gene annot
## 1 gene1 transcription
## 2 gene2 metabolism
## 3 gene3 translation
## 4 gene6 cell-cycle
## 5 gene7 translation
## 6 gene8 receptor
## 7 gene9 transcription
** Example 2 **
expt %>% filter(brain1>kidney1/2) %>% filter(brain2>kidney2/2) %>% filter(annot=="transcription") %>% select(gene)
## gene
## 1 gene1
## 2 gene9
Earlier we created the data frame ‘expt’ by combining gene, annot and experimental data from six tissues. In the data sets, we had 10 genes, their annotations and their expression levels.
gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
annot=c('transcription', 'metabolism', 'translation', 'transcription', 'cell-cycle', 'cell-cycle', 'translation', 'receptor', 'transcription', 'metabolism')
heart1=c(10,3,4,5,8,9,1,2,4,5)
kidney1=c(3,4,5,8,9,1,2,4,5,10)
brain1=c(4,5,8,9,1,2,4,5,3,2)
heart2=c(2,5,1,9,1,2,4,5,1,12)
kidney2=c(10,3,4,5,1,2,4,5,4,9)
brain2=c(8,2,7,2,1,2,4,5,3,2)
expt=data.frame(gene, annot, heart1,kidney1,brain1,heart2, kidney2, brain2)
Let’s say someone sends you another Excel spreadsheet with data from liver. What will you do? We will first create that new data frame with more mock data.
gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
liver=c(7,10,1,4,5,7,8,9,1,11)
expt2 = data.frame(gene,liver)
We want a combined spreadsheet from expt and expt2. The dplyr package has a function named ‘inner_join’ to accomplish that task.
inner_join(expt,expt2)
## Joining, by = "gene"
## gene annot heart1 kidney1 brain1 heart2 kidney2 brain2 liver
## 1 gene1 transcription 10 3 4 2 10 8 7
## 2 gene2 metabolism 3 4 5 5 3 2 10
## 3 gene3 translation 4 5 8 1 4 7 1
## 4 gene4 transcription 5 8 9 9 5 2 4
## 5 gene5 cell-cycle 8 9 1 1 1 1 5
## 6 gene6 cell-cycle 9 1 2 2 2 2 7
## 7 gene7 translation 1 2 4 4 4 4 8
## 8 gene8 receptor 2 4 5 5 5 5 9
## 9 gene9 transcription 4 5 3 1 4 3 1
## 10 gene10 metabolism 5 10 2 12 9 2 11
combined=inner_join(expt,expt2)
## Joining, by = "gene"
The first command displays the combined data frame on the screen, whereas the second command saves it in the variable named ‘combined’. You can then use ‘select’, ‘filter’ etc on the combined data frame to find patterns.
Please note -
Even if two data frames have rows in different orders, the function takes care of ordering them properly before joining data.
If one gene is missing from one data frame or another, ‘inner_join’ will remove all information for the gene from the combined data frame.
If you like to keep information on the missing genes, dplyr has a number of other functions. For example, ‘full_join’ will keep all genes and fill the missing places with ‘NA’. Check the following link for other options.
Functions discussed here -
Name | Action |
---|---|
read_excel | Read excel file |
read_csv | Read csv file |
read_tsv | Read tsv file |
write_csv | Write in csv format |
getwd | Get current working directory |
setwd | Change working directory |
Now that we mastered doing simple calculations in R, let us take a step forward to start analyzing real data. The first step is to import tibble from the external files, stored typically in text (csv) or Excel (xls) format.
R works in a directory or folder of your operating system so that when you save a file, it gets saved in that folder. You can see where R will save your files, try ‘getwd()’. You can also change the working directory by using ‘setwd()’.
getwd()
# [1] "C:/Users/Student/Documents"
setwd("C:/Users/Student/Documents/R")
R provides at least two ways to read data from an Excel file. The first method uses an external library ‘readxl’. In order to use functions from the library, it needs to be loaded in R by using the command ‘library(readxl)’.
library(readxl)
After that, the function ‘read_excel’ is used to read sheet 1 of the Excel file into a data frame (e.g. ‘mydata’). The concept of is explained below.
mydata <- read_excel("/full/path/to/excel/file/file.xls", sheet=1)
The other approach is to use Excel ‘save as’ function to save the spreadsheet in csv format. Then the csv file can be directly read from R using the ‘read_csv’ function. This function is already in tidyverse and does not require any additional library.
mydata <- read_csv("/full/path/to/csv/file/file.csv")
The command ‘write_csv’ will save a data frame or tibble in a file in csv format.
mydata %>% write_csv("file.csv")
library(Biostrings)
dna = DNAString("TGCTAATCCTCT")
dna %>% reverseComplement
## 12-letter "DNAString" instance
## seq: AGAGGATTAGCA
dna %>% translate
## 4-letter "AAString" instance
## seq: C*SS
dna %>% subseq(3,7)
## 5-letter "DNAString" instance
## seq: CTAAT
pairwiseAlignment("ATCCCTTAAAAGGTTGGGT","ATCCCTAAAAGGTTGGTT")
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: ATCCCTTAAAAGGTTGGGT
## subject: ATCCCT-AAAAGGTTGGTT
## score: 13.79057
genome = readDNAStringSet("C:/Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa")
#genome <- readDNAStringSet("C://Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa")
genome %>% subseq(2038152,2038823) %>% reverseComplement %>% translate %>% toString
genome <- DNAStringSet( list(chr1=DNAString("AATGGTCCGTG"), chr2=DNAString("TGGGTGGGTGG")) )
#
gr <- GRanges("chr1", IRanges(start=3,end=5), strand="+")
BSgenome::getSeq(genome, gr)
#
gr <- GRanges(c("chr1","chr2"), IRanges(start=c(1,3),end=c(4,5)), strand=c("+","-"))
BSgenome::getSeq(genome, gr)
ir = IRanges(start=c(1,100), width=c(10,10))
ir
#library(rtracklayer)
#genome <- readDNAStringSet("C://Users/Manoj/Desktop/R/web-tables/e")
#annot <- rtracklayer::import("C://Users/Manoj/Desktop/R/web-tables/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.39.chromosome.Chromosome.gff3")
#seq <- BSgenome::getSeq(genome, annot[100,])
gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'gene8', 'gene9', 'gene10')
heart1=c(10,3,4,5,8,9,1,2,4,5)
kidney1=c(3,4,5,8,9,1,2,4,5,10)
brain1=c(4,5,8,9,1,2,4,5,3,2)
heart2=c(2,5,1,9,1,2,4,5,1,12)
kidney2=c(10,3,4,5,1,2,4,5,4,9)
brain2=c(8,2,7,2,1,2,4,5,3,2)
expt=data.frame(gene, heart1,kidney1,brain1,heart2, kidney2, brain2)
Tidyverse likes the above style, whereas Bioconductor wants the gene names as the names of the columns.
expt_bioc=expt %>% select(-gene) %>% as.matrix
row.names(expt_bioc)=expt$gene
expt_bioc
## heart1 kidney1 brain1 heart2 kidney2 brain2
## gene1 10 3 4 2 10 8
## gene2 3 4 5 5 3 2
## gene3 4 5 8 1 4 7
## gene4 5 8 9 9 5 2
## gene5 8 9 1 1 1 1
## gene6 9 1 2 2 2 2
## gene7 1 2 4 4 4 4
## gene8 2 4 5 5 5 5
## gene9 4 5 3 1 4 3
## gene10 5 10 2 12 9 2
If you stick to the Bioconductor format and operate tidyverse functions, your gene names will disappear. Therefore, you need to get them as a column first.
expt=expt_bioc %>% as.data.frame %>% rownames_to_column("gene")
There are times you many also want to add the row number as a column. That task is simple, because you can apply “rownames_to_column†again.
expt=expt %>% rownames_to_column("id")