Chapter 19 Reading FASTA Files
In this chapter, we will write a script to read a FASTA file containing nucleotides.
The FASTA format is shown below -
>ID1
ATGTGGGAGG
AAGGTGGGTG
AAA
>ID2
AAAATGTGTGTGG
AAAT
>ID3 (gene pd1, discovered 2001)
ATGGTGATA
TTTTTTTTTTTTTTT
AAAATGTGTGT
The format reports the IDs and sequences of each gene. The ID line starts with ‘>’ and may have additional text following the identifier. The sequence can be split into multiple lines. There can be empty sentences in between two genes.
How do we write a script to read a FASTA file and use it for other analysis? This is a fascinating example covering many different aspects of programming.
19.1 Data structure
The first question is what type of data structure we should use to store the FASTA file. Data structure is the method to store data within your program. For individual data, you can hold in string, integer or decimal data structure. For collection of data, you need something like array or associative array. Graph is a possible data structure for data on Airplanes flying between cities.
As you continue to write more and more programs, you will realize that the data structure question becomes very important in your work. In fact, it is one of the two major components of programming, the other one being algorithm. After you master a language, you will find the language to be of less importance in day to day work.
Getting back to data structure question, it is clear that you can use string format to hold each gene or protein sequence. For the collection of genes, you need to use something like an array. In fact, you will need two arrays – one to store the ID and the other to hold the sequence. So, the format is ID_array[0]=‘ID1’ and seq_array[0]=‘ATGTGGGAGGAAGGTGGGTGAAA,’ and so on.
Is that the best data structure? You can answer only by thinking about what kind of questions you will have to answer in your program. In bioinformatics, often you will come across questions like ‘what is the sequence of gene ID2?’ To answer that, you will have to first go over the entire ID_array to find the coordinate of gene ID2 and then you go to seq_array to find its sequence. That entire process is time-consuming and you need a better data structure to answer the question very rapidly.
An associative array becomes a better answer in this case. In case of associative array, you will need only one array, whose identifiers are FASTA Ids and values are sequences. It will look like fasta_array[‘ID1’]=‘ATGTGGGAGGAAGGTGGGTGAAA,’ and so on for the rest of the genes.
In our program, we will use associative arrays.
19.2 Python code to read the FASTA file
#####################
# code to read FASTA file
#####################
import sys, re
f=open('genes.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
lre=re.compile('^(\S+)$')
gene={}
for line in lines:
outh = hre.search(line)
if outh:
id=outh.group(1)
else:
outl=lre.search(line)
if(id in gene.keys()):
gene[id] += outl.group(1)
else:
gene[id] =outl.group(1)
19.3 Describing How to Code (Based on Code for Reading a FASTA File)
Today we will do something different. I will explain how a program for a given problem is written from scratch, and all the trial-and-error that goes behind it. The example we choose is the one I sent out in the previous hand-out, namely the code for reading a FASTA file. I will explain the python version, but the thought process behind the PERL version follows similar logic.
The task at hand is to read a FASTA file. You know that FASTA format has the ID and sequence for a number of genes. The ID line starts with ‘>’ and may have additional text following the identifier. The sequence can be split into multiple lines. There can be empty sentences in between two genes.
19.4 Step 1. Reading each line
First, we will write a code that reads the lines from a given file. As you write more and more codes, you will find that you are dealing with this question almost always. Therefore, it is prudent to have a couple of templates for reading files ready somewhere in your directory. We will evaluate those templates and explain the pros and cons of each.
Template 1.
f=open('genes.fasta','r')
lines=f.readlines()
Template 2.
f=open('genes.fasta','r')
line1=f.readline()
line2=f.readline()
In both templates, ‘f’ is the identifier for the file ‘genes.fasta.’ That means wherever the symbol ‘f’ occurs in the rest of the code, the computer will know that it has to process the ‘genes.fasta’ file.
The second thing you notice is that the ‘open’ statement contains the option ‘r.’ That means the file is opened for read only, which is appropriate for our task here. You may have to choose other options for different tasks.
Why does python use this method of creating a file identifier (f) instead of using the actual name of the file in the entire code? There are two advantages. First, the file name can be very long, and the programmer may not like to clutter the program with that long text. The second and more important reason for using file identifier is that the filename may not be known before one runs the program. For example, it is very common to specify the filename as a command line option.
Next, let us discuss the differences between template 1 and template 2. In template 1, the entire file ‘genes.fasta’ is read in an array named ‘lines.’ Each element of the array is a string that holds a line from the FASTA file. In template 2, the file needs to be read line-by-line. In the example given above, we read only two lines and store them in strings line1 and line2.
You can see that template 1 is really convenient for programming, because the entire file is read in an array within the code and that array can be processed using ‘for,’ ‘, we can process the array within our code. However, there are some disadvantages of doing so, when it comes to bioinformatics. Some FASTA files can be very big. You certainly do not always want to load the entire human genome using a ’readlines’ statement.
Here is how the programmers write code. They keep the above constraint at the back of the mind and then start with the simple option. If that fails, then they take a step back and modify their code with the more difficult option. Therefore, let us first go with Template 1 for our code. Here is our code at the end of step 1.
f=open('genes.fasta','r')
lines=f.readlines()
After completion of every step, it is prudent to test and make sure the code works. Let us do that after the above step and add a print statement. Since lines array reads the ‘genes.fasta’ files, a statement line ‘print(lines[3])’ will print the third line. You can try it out.
f=open('genes.fasta','r')
lines=f.readlines()
print(lines[3])
Oops, there is a problem.
Traceback (most recent call last):
File "./c.py", line 2, in <module>
f=open('genes.fasta','r')
IOError: [Errno 2] No such file or directory: 'genes.fasta'
You realize that you either do not have a file named ‘genes.fasta’ or the file is not in the same directory as the python script.
Once you handle the above problem, you will notice two things - (i) The program is printing the fourth line instead of the third line. What is going on? By playing with different indices in the print statement, you find out that the array ‘lines’ stores the lines with indices 0,1,2,3,….. (ii) The program is printing an extra newline after the third line. Why so? The program is printing lines[3] from the file including the newline, and if you remember, python print command adds an extra newline after each statement. Hence you have one extra empty line. The best option is to remove newline from the end of lines[3]. The command for that is lines[3]=lines[3].rstrip(‘’). This command removes the newline from the end of lines[3] and saves it as the new lines[3]. Run the following code and you see it working fine.
f=open('genes.fasta','r')
lines=f.readlines()
lines[3]=lines[3].rstrip('\n')
print(lines[3])
Why not remove the newline from all lines? You try this code -
f=open('genes.fasta','r')
lines=f.readlines()
lines=lines.rstrip('\n')
print(lines[3])
and get the error -
Traceback (most recent call last):
File "./c.py", line 4, in <module>
lines=lines.rstrip('\n')
AttributeError: 'list' object has no attribute 'rstrip'
Oops, seems like trouble. What is going on? As it happens, the command rstrip(‘’) can only work on strings, or each element of the lines array, but not on the entire array at the same time. Let us keep this at the back of the mind and revisit later, if needed.
19.5 Step 2. Processing the ID
Let us now move on to processing the lines of the file, which are now stored in ‘lines’ array. We will first look for the FASTA ID. You know that the first line has a FASTA ID after ‘>.’ So, let us see, whether we can write code to extract the ID.
The method to do that is to create a regular expression. So, it is good time to review the unix ‘grep’ commands and python regular expression command, which came from ‘grep.’ The syntax for regular expression is as follows.
import re
f=open('genes.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
outh = hre.search(lines[0])
if outh:
print(outh.group(1))
When you compare with previous program, you will notice that we added a few extra lines.
The first extra line says ‘import re.’ That is to tell python that we are using the regular expression library.
The second extra line is ‘hre=re.compile(’>(+)‘). This line defines a regular expression named ’hre.’ The regular expression looks for the presence of ‘>+’ in a line, or the presence of ‘>’ followed by some non-whitespace text. What are the brackets for? If you put bracket in a regular expression, that is useful for returning the exact block that matches the text. Therefore, the above command will return the text that comes after ‘>’ symbol or it is the ID in the context of a FASTA format.
The third extra line is ‘outh = hre.search(lines[0]).’ This line searches for ‘hre’ in lines[0]. If the search finds a match, the variable ‘outh’ is 1. If not, it is 0.
The fourth part should be clear from the above description. If there is a match with lines[0], the program goes inside the ‘if’ section and prints ‘outh.group(1),’ which is the matched part inside the bracket.
I encourage you to change the regular expression and try it on various lines to see what you get.
19.6 Step 3. Processing the non-ID lines from FASTA file
You know that the FASTA file has other lines, which contain nucleotide or protein sequences. How do we extract data from them? The answer is regular expression again. We create another regular expression for the non-ID lines. Our code grows to -
import re
f=open('genes.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
lre=re.compile('^(\S+)$')
outl = lre.search(lines[1])
if outl:
print(outl.group(1))
Here we are trying a second regular expression ‘lre.’ This regular expression looks for ‘^(+)\('. What that means is it is searching for a line that is entirely non-whitespace characters (\S+). The '^' at the beginning and '\)’ at the end make sure the search string matches the entire line, excluding the terminal newline.
To understand how it works, you need to try the code yourself. No matter of description can make this clear.
19.7 Step 4. Data structure
Alright. Now that we know how to process each line of the file, the task is to process the entire file and store into something within the code. What is that ‘something’ going to be? We need to decide about a data structure. Data structure is the method to store data within your program. For individual data, you can hold in string, integer or decimal data structure. For collection of data, you need something like array or associative array.
As you continue to write more and more programs, you will realize that the data structure question becomes very important in your work. In fact, it is one of the two major components of programming, the other one being algorithm. After you master a language, you will find the language to be of less importance in day to day work.
Getting back to data structure question, it is clear that you can use string format to hold each gene or protein sequence. For the collection of genes, you need to use something like an array. In fact, you will need two arrays – one to store the ID and the other to hold the sequence. So, the format is ID_array[0]=‘ID1’ and seq_array[0]=‘ATGTGGGAGGAAGGTGGGTGAAA,’ and so on.
Is that the best data structure? You can answer only by thinking about what kind of questions you will have to answer in your program. In bioinformatics, often you will come across questions like ‘what is the sequence of gene ID2?’ To answer that, you will have to first go over the entire ID_array to find the coordinate of gene ID2 and then you go to seq_array to find its sequence. That entire process is time-consuming and you need a better data structure to answer the question very rapidly.
An associative array becomes a better answer in this case. In case of associative array, you will need only one array, whose identifiers are FASTA Ids and values are sequences. It will look like fasta_array[‘ID1’]=‘ATGTGGGAGGAAGGTGGGTGAAA,’ and so on for the rest of the genes.
In our program, we will use associative arrays. What does that mean for the actual code? Let us open a different file and try a simple example.
gene={}
gene['ID1']='AAAAA'
gene['ID2']='ATGGTG'
print(gene)
You will see that the program prints {‘ID2’: ‘ATGGTG,’ ‘ID1’: ‘AAAAA’}. We have the syntax for an associative array. The next step is to combine it with the regular expression blocks from the other steps.
19.8 Step 5. For loop and final code
The associative array ‘gene’ is the perfect data structure for our code, because it saves FASTA IDs and sequences as pairs. However, instead of telling the program that ‘gene[’ID1’]=’AAAA’ and so on, we need to enter values into ‘gene’ based on processed FASTA file. That means you need a for loop to go over the elements of the array ‘lines,’ process each line to find the ID or the sequence and keep adding information to the associative array ‘gene.’ The program turns into -
#####################
# code to read FASTA file
#####################
import sys, re
f=open('genes.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
lre=re.compile('^(\S+)$')
gene={}
for line in lines:
outh = hre.search(line)
if outh:
id=outh.group(1)
else:
outl=lre.search(line)
if(id in gene.keys()):
gene[id] += outl.group(1)
else:
gene[id] =outl.group(1)
19.9 Line by line analysis
import re
f=open('MG1655-K12.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
lre=re.compile('^(\S+)$')
gene={}
i=0
for line in lines:
i=i+1
if (i>66000):
print(i)
outh = hre.search(line)
if outh:
id=outh.group(1)
else:
outl=lre.search(line)
if(id in gene.keys()):
gene[id] += outl.group(1)
else:
gene[id] =outl.group(1)
19.10 General guidelines
For understanding the code, you need to keep a notebook, where you will write down the state of the program. I will explain what that means. In the notebook, create three columns - line number, variables and libraries. In the line number, the computer is at 2.
19.10.1 line 1
A computer runs program linearly, unless told otherwise. That means it starts the execution from line 1 and ends at the last line. In case of above script, the computer comes in, reads the first line and runs the python interpretator to execute remaining lines.
19.10.2 line 2
Python interpretator reads the second line and learns to import a library re
. So far so good. On your notebook, add re
The variables column is empty. The libraries column has re
.
19.10.3 line 3
Now the computer goes to line 3 and sees f=open('MG1655-K12.fasta','r')
It creates a variable f
and links to the file MG1655-K12.fasta
in the physical directory (disk). Now you have f
in the variables column.
19.10.4 line 4
The computer sees lines=f.readlines()
. What is f
? It is the E coli file. What is lines
? It is a new variable. So, you have f
and lines
in the variables column.
Also, at this line, the computer is asked to read the file f
(readlines
function) and store in lines
. That means lines
is a variable of type array. I also forgot to mention that f
is a variable of type file.
19.10.5 lines 5,6
The computer creates two new variables hre
and lre
using the compile function from re
library. Those two variables are of regular
expression type. How many variables do we have in the variables column? - f
(file), lines
(array of string), hre
(reg exp),
lre
(reg exp).
19.10.6 lines 7, 8
Two new variables - gene
(and empty array) and i
(integer with value 0).
19.10.7 line 9
Now the fun begins. Remember - the computer runs a program from the first to the last line, unless told otherwise.
for
command is one way to tell the computer otherwise. The for
command asks the computer to keep looping through the next set of lines, until a condition is satisfied. That condition is line in lines
.
That means the computer will keep looping through the next few lines for 66284 times. Why? Because the array lines
has that many entries.
Question for everyone - what do you have in the variables column now?
anyone with me?
So our variables are now:
f
(file), lines
(array of string), hre
(reg exp), lre
(reg exp), and gene(and empty array) and
i(integer with value 0). Great ! We do have an extra variable after line 9, and it is
line`.
The rest are as
19.10.8 line 10
The value of i
is now i+1
. That means i
has become 1 now.
19.10.9 line 11,12
The computer asks if (i>66000):
Only if the answer is yes, the computer goes to the next line. Otherwise, jumps straight to line 13.
19.10.10 line 13
outh = hre.search(line)
The computer runs the function - hre.search(line)
. This function returns 1, if variable line
matches the regular expression hre
.
because we declare it as
hre =re.compile('>(\S+)')
in the start so it will be looking for the fasta header.
That is correct about hre
, but what is line
in line 13 now? (and why)?
Is it the next line down from the > line? in the fasta file, i mean. the following line?
No, it is the first line in the FASTA file. Let us go back to see where the line
variable comes from. The variable is defined in 9th line of the code, which starts the for
loop.
What if the file is multi-fasta?
The variable is defined as the first element of the lines
array, which means the first line of the FASTA file. That is because the computer read the entire FASTA file and stored in multiple entries of the lines
variable.
would that change if there is a new fasta header on line 12?
the command f.readlines
reads a file with text and saves in an array. Whatever is in the first line of the file will be in lines[0]
of the array. Whatever is in the second line, will be lines[1]
and so on.
If there is a new fasta header on line 12, that will be lines[12]
in the array.
The entire file (in the disk) is being read into an array, which is a variable within the program.
Maybe expanding on what X said, and oi may not be understanding, but it seems like the program is having a difficult time treating the entire e. coli genome as a single array, and adding line after line
So if we reset the array, the program would be able to process it without having start at the beginning of the fasta file each time
The program is going through the entire file in the disk only once. That is minimum of what one can do.
The entire for
loop from l line 9 to the end runs once for every line in the FASTA file.
When it runs the first time, the variable line
has the first line of the file. So, the rest of the code checks for regular expression, finds a match with hre
and determines that it is a FASTA header. The program acts accodingly.
What does act accordingly mean? The program works through these two lines -
(edited)
if outh:
id=outh.group(1)
That means, if outh
is 1, the program creates another new variable named id
and stores the value defined by outh.group(1)
.
What is outh.group(1)
? To find that out, we need to go back to the definition of the regular expression hre
, because outh
is the outcome of that regular expression search. What is hre
? It is hre =re.compile('>(\S+)')
.
You see \S+
in first bracket. That means any text in line
that matches as \S+
will be returned by the code as outh.group(1)
.
What does that mean? To trace back, we need to look at the line
variable. That variable has the first line of the FASTA file, which happens to have the FASTA header. The regular expression searches through the FASTA header and finds out the text right after >
.
That is returned as id=outh.group(1)
.
import re
f=open('MG1655-K12.fasta','r')
lines=f.readlines()
hre=re.compile('>(\S+)')
lre=re.compile('^(\S+)$')
gene={}
i=0
for line in lines:
i=i+1
if (i>66000):
print(i)
outh = hre.search(line)
if outh:
id=outh.group(1)
print(id)
with the accompanying indents - they did not copy over unfortunately
you may add the line exit(0)
after the print command. That way it will exit right after executing the command and you will not see the rest of the output from the other lines.
exit(0)
the slowdown comes from this line gene[id] += outl.group(1)
What is the line doing? The command +=
means it adds a string at the end of the existing string.
Think about doing the same on your notepad. What will happen?
19.11 Debugging
Each time you are adding the string at the end, your variable gets a bit bigger on the paper, right? So, if you want to keep track of it, after first few lines, you realize that you do not have enough space. So, you copy that particular variable in a new page and then erase the old copy. Again you are adding more and more text at the end of the variable. Once again you run out of space. So, copy everything in an even bigger paper, and erase the old variable. This copy and erase part will take a while, when you have, say, 300 characters, right? The same physical process is taking place in the computer except that it gets tired with bigger string than you doing manually. (edited) Regarding the error in the code, it fails at the absolutely last line of the FASTA file.