Chapter 16 Biology Problems

16.1 Counting Nucleotides

Count the number of different kinds of nucleotides (A/T/G/C) as a percentage of total number of nucleotides.

x="ATTGCTGAGAGCCGATTAAA"
count={'A': 0, 'C': 0, 'G': 0, 'T': 0 }
L=len(x)

for i in range(L):
   count[x[i]] = count[x[i]]+1

print(count)

16.2 Reverse Complement

Note: Reverse a string

http://codereview.stackexchange.com/questions/102052/reversing-a-string-in-python

We will use our Python skills to solve problems.

The first task is to reverse a string.

print("flash"[::-1])

"hsalf"

16.3 Translating Nucleotide Sequences to into Protein Sequences

T={'TTT':'F', 'TTC':'F', 'TTA':'L', 'TTG':'L', 'TCT':'S', 'TCC':'S', 'TCA':'S', 'TCG':'S',
'TAT':'Y', 'TAC':'Y', 'TAA':'STOP', 'TAG':'STOP', 'TGT':'C', 'TGC':'C', 'TGA':'STOP', 'TGG':'W',
'CTT':'L', 'CTC':'L', 'CTA':'L', 'CTG':'L', 'CCT':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'CAT':'H',
'CAC':'H', 'CAA':'Q', 'CAG':'Q', 'CGT':'R', 'CGC':'R', 'CGA':'R','CGG':'R',
'ATT':'I','ATC':'I','ATA':'I','ATG':'M','ACT':'T','ACC':'T','ACA':'T','ACG':'T',
'AAT':'N','AAC':'N','AAA':'K','AAG':'K','AGT':'S','AGC':'S','AGA':'R','AGG':'R',
'GTT':'V','GTG':'V','GTA':'V','GTG':'V','GCT':'A','GCC':'A','GCA':'A','GCG':'A',
'GAT':'D','GAC':'D','GAA':'E','GAG':'E','GGT':'G','GGC':'G','GGC':'G','GGA':'G','GGG':'G'};

f=open('gene','r')
line=f.readline()
i=0

while(i<1000):
        x = line[i:i+3]
        print(T[x],)
        i=i+3

T={'TTT':'F', 'TTC':'F', 'TTA':'L', 'TTG':'L', 'TCT':'S', 'TCC':'S', 'TCA':'S', 'TCG':'S', 
'TAT':'Y', 'TAC':'Y', 'TAA':'STOP', 'TAG':'STOP', 'TGT':'C', 'TGC':'C', 'TGA':'STOP', 'TGG':'W', 
'CTT':'L', 'CTC':'L', 'CTA':'L', 'CTG':'L', 'CCT':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'CAT':'H', 
'CAC':'H', 'CAA':'Q', 'CAG':'Q', 'CGT':'R', 'CGC':'R', 'CGA':'R','CGG':'R',
'ATT':'I','ATC':'I','ATA':'I','ATG':'M','ACT':'T','ACC':'T','ACA':'T','ACG':'T',
'AAT':'N','AAC':'N','AAA':'K','AAG':'K','AGT':'S','AGC':'S','AGA':'R','AGG':'R',
'GTT':'V','GTG':'V','GTA':'V','GTG':'V','GCT':'A','GCC':'A','GCA':'A','GCG':'A',
'GAT':'D','GAC':'D','GAA':'E','GAG':'E','GGT':'G','GGC':'G','GGC':'G','GGA':'G','GGG':'G'};

f=open('gene','r')
line=f.readline()
i=0

while(i<1000):
        x = line[i:i+3]
        print(T[x],)
        i=i+3

16.4 Computing GC-content of a genome

A genome sequence consists of four types of nucleotides (A, T, G, C). Often it may also include N to represent the unsequenced bases. The GC content of a genome means the ratio of the number of G/C bases and the number of A/T/G/C bases.

Write a program to compute the GC-content of the E. coli genome.

16.5 Counting K-mers in the E. coli Genome

In this problem, you need to identify k-mer statistics of the E. coli genome for k=21. The concept of k-mer statistics is explained below.

You can split a genome sequence or any other long nucleotide sequence into short overlapping ‘kmers.’ For example, the sequence ‘ATGCTGGAAAT’ can be split into the following 3-mers: ATG, TGC, GCT, CTG, TGG, GGA, GAA, AAA, AAT. Similarly, if you choose k=4, you get the following 4-mers: ATGC, TGCT, GCTG, CTGG, TGGA, GGAA, GAAA, AAAT. In each case, you also need to consider the reverse strand by taking reverse complement of the entire sequence, and that gives you another set of k-mers.

Some k-mer may appear multiple times within a genome or a long sequence. Therefore, ‘k-mer statistics’ of genome is a table with the sequences of the k-mers and the their corresponding counts.

As an example, the k-mer (k=3) statistics of AAAAAAAA is - (AAA 6 times and TTT 6 times) after including both strands.