Session 20 – Biopython

20.1 Introduction to Biopython

Guide adapted from: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec3

All credit is due to Biopython distribution

Imagine freely available tools for computational molecular biology. That is the essense of Biopython. For an online repository of all things related to Biopython feel free to visit the official webpage: http://www.biopython.org. Biopython includes various bioinformatic parsers, remote database and datatool access, custom access to popular aligners and much more. In summary, Biopython provides a toolset of the most common tools in molecular biology.

20.2 Biopython Installation

Installing Biopython. In Mac open terminal by typing ‘terminal’ in finder and selecting it. In Windows open terminal by typing ‘terminal’ in the desktop search bar and selecting it.

Next type the following into the terminal.

pip install biopython
pip install --upgrade biopython

The first command will install Biopython. The second command will upgrade Biopython.

20.3 Getting Started

In order to use Biopython effectively you need to have coding experience and it also helps to have a introduction to python. We provded a short introduction to python so that biopython would be easier to understand and utilize.

When we design an experiment to test a hypothesis, we often refer to past methodology that could help structure our own. In a similar way, in programming one would look for past approaches to solving a particular problem as well as looking for if those tools are available for reuse. We try in data science to reuse as many tools as necessary as reinventing the wheel, starting from scratch, is often time consuming and completely unnecessary.

Sometimes you will find numerous ways of completing a task. In this case you will need to find the most efficient tool to use. Comparing the outputs of each particular program and deciding for yourself which one is best for your particular task.

Lets start working with sequences with the ‘Seq’ object via the SeqRecord class. First to utilize these features we must import Seq from Bio.Seq module toolkit.

# Open Bio.Seq Module, Import what you need only: Seq
from Bio.Seq import Seq

# Create your own sequence inside the Seq() object
my_first_seq = Seq("TGCATGCATCGATCG")

# Print command mediated print function
print(my_first_seq)

# find the complement sequence easily with: complement()
my_first_seq.complement()

# find the reverse complement easily with reverse_complement()
my_first_seq.reverse_complement()

20.4 Parsing my FASTA

FASTA, one line header followed by one line sequence (or multi-line sequence). FASTA is the simiplest fashion to represent sequences. If we wanted to get more complex we could add quality scores for each base in the sequence, which would make it a FASTQ.

If we wanted all details pertaining to these sequences we could throw in everything but the kitchen sink and call it a GenBank file. This would include everything from who it is from, where it was collected, which experiment it was part of, etc.

Lets go to NCBI and select a gene for our next activity: https://www.ncbi.nlm.nih.gov/protein

For this activity I will focus on Buforin I but you are free to choose any protein that interests you and proceed by downloading the complete fasta file with all correlating sequences.

Open the file that you download to check and make sure it is in FASTA format. Never assume that the files downloaded match the description.

# configure file location
fasta_file = PATH TO FASTA FILE

# import SeqIO toolkit
from Bio import SeqIO

# for X in fasta file, print ID SEQUENCE and LENGTH of SEQ
for seq_record in SeqIO.parse(fasta_file, "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

You can see how easy it is to pull information from the FASTA file with a couple lines of code. We can print that information into the console, we can also save that data in python to filter it even further and in -*/e cases it may be better to save that data. Lets see a few of those commands.

# import module for changing directories
import os

# Change Directory
os.chdir("Your Python Work Folder For This Class")

# Save the data!
for seq_record in SeqIO.parse(fasta_file, "fasta"):
    with open('parsed_fasta.txt', 'a') as f:
        f.write(seq_record.id)
        f.write("\n")
        f.close()

20.5 Parsing my GenBank or GenPept

If the input is not in FASTA format, you can make some minor tweaks and use a GenBank or GenPept file instead.

Lets go back to NCBI and select a gene for our next activity but this time lets select the genpept download: https://www.ncbi.nlm.nih.gov/protein

Move this file into your python working directory, then proceed with the commands below:

from Bio import SeqIO
for seq_record in SeqIO.parse("sequence.gp", "genbank"):
     print(seq_record.id)
     print(repr(seq_record.seq))
     print(len(seq_record))

There are way more parsers in Biopython that we do not have time to go over all of them. Feel free to explore them on your own after class.