Session 25 – Biopython: Sequence Parsing with Plots
25.1 Histograms
Histogram of Sequence Lengths in Fasta File (Genbank (full) > Export CDS)
Dataset is in this link
Install matplotlib: $pip install matplotlib (for macOS use: \(pip3 install matplotlib). The `\)` indicates that the install in terminal.
from Bio import SeqIO= [len(rec) for rec in SeqIO.parse("XL_1L_fasta.txt", "fasta")]
sizes
import pylabpylab.hist(sizes, bins=20)
pylab.title(
"%i Xenopus laevis sequences\nLengths %i to %i" % (len(sizes), min(sizes), max(sizes))
)pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
#pylab.show()
pylab.savefig("pylab_fig1.pdf")
25.2 GC Content
GC content of Xenopus laevis 1L genome fasta file.
from Bio import SeqIO
from Bio.SeqUtils import GC
= sorted(GC(rec.seq) for rec in SeqIO.parse("XL_1L_fasta.txt", "fasta"))
gc_values
import pylab
pylab.plot(gc_values)
pylab.title(
"%i orchid sequences\nGC%% %0.1f to %0.1f"
len(gc_values), min(gc_values), max(gc_values))
% (
)pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()
25.3 Nt Alignment Dot Plot
I will use the previous fasta file from Xenopus laevis genome to provide 2 sequences for the activity below, but you are welcome to align any two genes of your choice as well.
from Bio import SeqIO
open("XL_1L_fasta.txt") as in_handle:
with = SeqIO.parse(in_handle, "fasta")
record_iterator = next(record_iterator)
rec_one = next(record_iterator)
rec_two
= 7
window = str(rec_one.seq).upper()
seq_one = str(rec_two.seq).upper()
seq_two = [
data
[: i + window] != seq_two[j : j + window])
(seq_one[i for j in range(len(seq_one) - window)
]for i in range(len(seq_two) - window)
]
import pylab
pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()
25.4 Bigger Dots for the Dot Plot
Bigger dots may be better in some scenarios, thus the code below will increase the size of the dots.
from Bio import SeqIO
open("XL_1L_fasta.txt") as in_handle:
with = SeqIO.parse(in_handle, "fasta")
record_iterator = next(record_iterator)
rec_one = next(record_iterator)
rec_two
= 7
window = {}
dict_one = {}
dict_two for (seq, section_dict) in [
str(rec_one.seq).upper(), dict_one),
(str(rec_two.seq).upper(), dict_two),
(:
]for i in range(len(seq) - window):
= seq[i : i + window]
section :
try.append(i)
section_dict[section]:
except KeyError= [i]
section_dict[section]
= set(dict_one).intersection(dict_two)
matches print("%i unique matches" % len(matches))
= []
x = []
y for section in matches:
for i in dict_one[section]:
for j in dict_two[section]:
x.append(i)
y.append(j)
pylab.cla() # clear any prior graph
pylab.gray()
pylab.scatter(x, y)
pylab.xlim(0, len(rec_one) - window)
pylab.ylim(0, len(rec_two) - window)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()
25.5 FASTQ we did not forget you
We have spent an extensive time manipulating FASTA files with little mention to its bigger cousin FASTQ. As mentioned before, FASTQ is simply a FASTA with per base quality scores. We can use this quality score data to assess average read reliability or even file reliability
This is a very large file (500MB), you can skip this activity is its not applicable to your project.
Dataset: https://www.ebi.ac.uk/ena/data/view/SRR001666
I will utilize a different SRR that is smaller for the below example. You will need to change the filename SRR number if you wish to utilize the code for your own files. Also note that you will have to use files that are not compressed, thus you must uncompress the file before you can analyze it.
import pylab
from Bio import SeqIO
for subfigure in [1, 2]:
= "SRR11880886_%i.fastq" % subfigure
filename pylab.subplot(1, 2, subfigure)
for i, record in enumerate(SeqIO.parse(filename, "fastq")):
if i >= 50:
break
pylab.plot(record.letter_annotations["phred_quality"])
pylab.ylim(0, 45)
pylab.ylabel("PHRED quality score")
pylab.xlabel("Position")
pylab.show()