Session 25 – Biopython: Sequence Parsing with Plots

25.1 Histograms

Histogram of Sequence Lengths in Fasta File (Genbank (full) > Export CDS)

Dataset is in this link

Install matplotlib: $pip install matplotlib (for macOS use: $pip3 install matplotlib). The `$` indicates that the install in terminal.

from Bio import SeqIO
sizes = [len(rec) for rec in SeqIO.parse("XL_1L_fasta.txt", "fasta")]

import pylab
pylab.hist(sizes, bins=20)
pylab.title(
    "%i Xenopus laevis sequences\nLengths %i to %i" % (len(sizes), min(sizes), max(sizes))
)
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
#pylab.show()
pylab.savefig("pylab_fig1.pdf")

25.2 GC Content

GC content of Xenopus laevis 1L genome fasta file.

from Bio import SeqIO
from Bio.SeqUtils import GC

gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse("XL_1L_fasta.txt", "fasta"))

import pylab

pylab.plot(gc_values)
pylab.title(
    "%i orchid sequences\nGC%% %0.1f to %0.1f"
    % (len(gc_values), min(gc_values), max(gc_values))
)
pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()

25.3 Nt Alignment Dot Plot

I will use the previous fasta file from Xenopus laevis genome to provide 2 sequences for the activity below, but you are welcome to align any two genes of your choice as well.

from Bio import SeqIO

with open("XL_1L_fasta.txt") as in_handle:
    record_iterator = SeqIO.parse(in_handle, "fasta")
    rec_one = next(record_iterator)
    rec_two = next(record_iterator)

window = 7
seq_one = str(rec_one.seq).upper()
seq_two = str(rec_two.seq).upper()
data = [
    [
        (seq_one[i : i + window] != seq_two[j : j + window])
        for j in range(len(seq_one) - window)
    ]
    for i in range(len(seq_two) - window)
]

import pylab

pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()

25.4 Bigger Dots for the Dot Plot

Bigger dots may be better in some scenarios, thus the code below will increase the size of the dots.

from Bio import SeqIO

with open("XL_1L_fasta.txt") as in_handle:
    record_iterator = SeqIO.parse(in_handle, "fasta")
    rec_one = next(record_iterator)
    rec_two = next(record_iterator)

window = 7
dict_one = {}
dict_two = {}
for (seq, section_dict) in [
    (str(rec_one.seq).upper(), dict_one),
    (str(rec_two.seq).upper(), dict_two),
]:
    for i in range(len(seq) - window):
        section = seq[i : i + window]
        try:
            section_dict[section].append(i)
        except KeyError:
            section_dict[section] = [i]

matches = set(dict_one).intersection(dict_two)
print("%i unique matches" % len(matches))

x = []
y = []
for section in matches:
    for i in dict_one[section]:
        for j in dict_two[section]:
            x.append(i)
            y.append(j)

pylab.cla()  # clear any prior graph
pylab.gray()
pylab.scatter(x, y)
pylab.xlim(0, len(rec_one) - window)
pylab.ylim(0, len(rec_two) - window)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()

25.5 FASTQ we did not forget you

We have spent an extensive time manipulating FASTA files with little mention to its bigger cousin FASTQ. As mentioned before, FASTQ is simply a FASTA with per base quality scores. We can use this quality score data to assess average read reliability or even file reliability

This is a very large file (500MB), you can skip this activity is its not applicable to your project.

Dataset: https://www.ebi.ac.uk/ena/data/view/SRR001666

I will utilize a different SRR that is smaller for the below example. You will need to change the filename SRR number if you wish to utilize the code for your own files. Also note that you will have to use files that are not compressed, thus you must uncompress the file before you can analyze it.

import pylab
from Bio import SeqIO

for subfigure in [1, 2]:
    filename = "SRR11880886_%i.fastq" % subfigure
    pylab.subplot(1, 2, subfigure)
    for i, record in enumerate(SeqIO.parse(filename, "fastq")):
        if i >= 50:
            break
        pylab.plot(record.letter_annotations["phred_quality"])
    pylab.ylim(0, 45)
    pylab.ylabel("PHRED quality score")
    pylab.xlabel("Position")

pylab.show()