Session 31 – Working with Proteins
Most bioinformatics uses sequences from proteins derived from protein sequencing or translation (DNA/RNA to protein) from nucleotide sequences. Such protein analyses usually include sequence analyses (e.g., amino acid composition), protein 2D-3D structure analyses, and comparison between organisms or species (evolution analyses). At the most basic stage, we can compare an amino acid sequence of an unknown protein to those in the ever-growing databases of proteins with known functions. This approach will help us to identify and possible infer protein function, location of expression (e.g., cell region or tissue), and new/novel changes on the amino acid sequence. Protein structure analyses usually involve complex mathematical calculations and biophysical studies to infer how protein looks in a living organism. Using bioinformatics can help us to infer the folding and stability of proteins and to predict the secondary structures from amino acid sequences. Lastly, as we explore the diversity of life as more and more species have their genomes/transcriptomes sequenced, we might want to infer the connection between the phenotype (how an organism looks like or its observable characteristics) and the underlying genotype (genome and gene sequences) as there are expressed in response to the environment. For this purpose, we can use phylogenetic methods to find homologous proteins and similar protein regions that might perform similar enzymatic functions. In this session, we will explore how to work with amino acid sequences.
31.1 Sequences in FASTA format
Nucleotide and amino sequences can be encoded in FASTA that is a text-based format where a single letter represent a nucleotide or amino acid. FASTA has a standard configuration that includes two main components: (1) A single-line description that starts with “>” symbol. This usually includes information for the user to identify the gene or protein sequences that might follow in the next line; and (2) A single-line or wrapped text that includes the nucleotide or amino acid sequence. This second line includes very specific codes for nucleotides and amino acids in the standard IUB/IUPAC and additional codes.
The nucleic acid codes supported are:
Nucleic Acid Code | Meaning | Mnemonic |
---|---|---|
A | A | Adenine |
C | C | Cytosine |
G | G | Guanine |
T | T | Thymine |
U | U | Uracil |
(i) | i | inosine (non-standard) |
R | A or G (I) | puRine |
Y | C, T or U | pYrimidines |
K | G, T or U | bases which are Ketones |
M | A or C | bases with aMino groups |
S | C or G | Strong interaction |
W | A, T or U | Weak interaction |
B | not A (i.e. C, G, T or U) | B comes after A |
D | not C (i.e. A, G, T or U) | D comes after C |
H | not G (i.e., A, C, T or U) | H comes after G |
V | neither T nor U (i.e. A, C or G) | V comes after U |
N | A C G T U | Nucleic acid |
- | gap of indeterminate length |
The amino acid codes supported (22 amino acids and 3 special codes) are:
Amino Acid Code | Symbol | Meaning |
---|---|---|
A | ALA | alanine |
B | ASX | aspartate or asparagine |
C | CYS | cystine |
D | ASP | aspartate |
E | GLU | glutamate |
F | PHE | phenylalanine |
G | GLY | glycine |
H | HIS | histidine |
I | ILE | isoleucine |
K | LYS | lysine |
L | LEU | leucine |
M | MET | methionine |
N | ASN | asparagine |
P | PRO | proline |
Q | GLN | glutamine |
R | ARG | arginine |
S | SER | serine |
T | THR | threonine |
U | selenocysteine | |
V | VAL | valine |
W | TRP | tryptophan |
Y | TYR | tyrosine |
Z | GLX | glutamate or glutamine |
X | any | |
* | translation stop | |
- | gap of indeterminate length |
More information can be found here.
31.2 Fetching AA sequences from NCBI
1) To import amino acid (AA) sequences into R, we want to read a file with AA sequences to create a character vector contains text (e.g., sequence of letters for nucleotide or amino acid sequence). In R, you can easily interact with NCBI database and fetch sequences as vectors without the need to save those sequences into a file. For example, we will download protein sequences of a COX1 mitochondrial gene of Epipedobates anthonyi.
We will follow the example from previous sessions using the R-package rentrez.
library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")
# Download some amino sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene COI
<- "Epipedobates anthonyi[Organism] AND COI[Gene]"
froggy_name_COX1_prot <- entrez_search(db="protein", term= froggy_name_COX1_prot)
froggy_seq_IDs_prot
froggy_seq_IDs_prot#Entrez search result with 2 hits (object contains 2 IDs and no web_history object)
# Search term (as translated): "Epipedobates anthonyi"[Organism] AND COI[Gene]
# revising the structure of 'froggy_seq_IDs_prot' that there are 2 sequence in NCBI protein database
str(froggy_seq_IDs_prot)
#List of 5
# $ ids : chr [1:2] "1952638368" "110332819"
# $ count : int 2
# $ retmax : int 2
# $ QueryTranslation: chr "\"Epipedobates anthonyi\"[Organism] AND COI[Gene]"
# $ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
# - attr(*, "class")= chr [1:2] "esearch" "list"
<- entrez_fetch(db="protein", id=froggy_seq_IDs_prot$ids, rettype="fasta")
froggy_seqs_prot_fasta cat(froggy_seqs_prot_fasta)
#>QQL01627.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]
#VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMV
#MPILIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASAGVEAGAGTGWTVYPPLAGNLAHAG
#PSVDLTIFSLHLAGVSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
#TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHVVTFYSSKKEPFGYMGMVWAMMSI
#GLLGFIVWAHHMFTTDLNVDTRAYFTSATMIIAIPTGVKVFSWLATMHGGIIKWDAAMLWALGFIFLFTV
#GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMAGFVHWFPLFTGFTLHEAWTKIHFGVMFAG
#VNLTFFPQHFLGLAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPT
#NVEWLYGSPPPYHTFEEAVYSKV
#
#>ABG67395.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]
#TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLM
#IGAPDMAFPRMNNMSFWLLPPSFLLLLASTGVEAGAGTGWTVYPPLAGNLAHAGPSVDLTIFSLHLAGVS
#SILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGG
#DPILYQHLF
We can save these sequences in a text file in your working directory.
# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")
write(froggy_seqs_prot_fasta, "Eanthonyi_COI_prot_seqs.fasta")
2) Alternatively, you can import nucleotide sequences into R and then translate them to protein. As above, we want to read a file with nucleotide sequences to create a character vector contains text (e.g., sequence of letters for nucleotide sequences). In R, you can easily interact with NCBI database and fetch sequences as vectors without the need to save those sequences into a file.
We will follow the example from previous sessions using the R-package rentrez. However, we will work with a more complicated sequence for the mitochondrial ND2 gene.
library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")
# Download some amino sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene ND2
<- "Epipedobates anthonyi[Organism] AND ND2[Gene]"
froggy_name_ND2_nuc <- entrez_search(db="nuccore", term= froggy_name_ND2_nuc)
froggy_seq_IDs_nuc
froggy_seq_IDs_nuc#Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
# Search term (as translated): "Epipedobates anthonyi"[Organism] AND ND2[Gene]
# revising the structure of 'froggy_seq_IDs_nuc' that there are 1 sequence in NCBI protein database
str(froggy_seq_IDs_nuc)
#List of 5
# $ ids : chr "328728262"
# $ count : int 1
# $ retmax : int 1
# $ QueryTranslation: chr "\"Epipedobates anthonyi\"[Organism] AND ND2[Gene]"
# $ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
# - attr(*, "class")= chr [1:2] "esearch" "list"
<- entrez_fetch(db="nuccore", id=froggy_seq_IDs_nuc$ids, rettype="fasta")
froggy_seqs_nuc_fasta cat(froggy_seqs_nuc_fasta)
#>HQ290995.1 Epipedobates anthonyi 12S ribosomal RNA gene, partial sequence; tRNA-Val, 16S ribosomal RNA, and tRNA-Leu genes, complete sequence; NADH dehydrogenase subunit 1 (ND1) gene, partial cds; tRNA-Ile, tRNA-Gln, and tRNA-Met genes, complete sequence; and NADH dehydrogenase subunit 2 (ND2) gene, complete cds; mitochondrial
#AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAA
#AACGCCCTTATTCTCCCCTACAGGGAGCAGGAGCCGGTATCAGGCACAAATTTTTGCCCATAACACCTAG
#CATCGCCACACCCACAAGGGACTTCAGCAGTGATTAACATTGAGCATAAGCGATAGCTTGACTCAGTTAA
#ACTAAACAGAGCCGGCCAATCTGGTGCCAGCCGCCGCGGTTACACCACTGTGGCTCAAATTGATTTCTAT
#CGGCGTAAAGCGTGATTAAAGATTAATCTTTTGGAGTTAAACTAAAATTAAGCTGTGACACGCTTATTAC
#CTAAGAAAATCATAAACGAAAGCTACTCCAATATTACCAACTTGAATTCACGACAACTGAGGAACAAACT
#GGGATTAGATACCCCACTATGCTCAGTCGTAAACTTTAACTTACATCTTTCTACTCGCCAGGGAACTACG
#AGCTAAGCTTAAAACCCAAAGGACTTGACGGTACCCCATATCCCCCTAGAGGAGCCTGTCCTATAATCGA
#TAACCCCCGTTTTACCTCACCATTTTTTGTTAATCAGCCTGTATACCTCCGTCGTCAGCTTACCACGTGA
#GCGTGTGAGCTAAATGTTTTTTTCAACCACACGTCAGGTCAAGGTGCAGCAAATAAAATGGAAAGAGATG
#GGCTACACTCTCTATTCTAGAATAAACAAAAGACTAAATGAAACCCGGTCAGAAGGCGGATTTAGCAGTA
#AAATGAAACTAGAGCGTTCATTTAAACATGGCACTGGGGTGTGTACACACCGCCCGTCACCCTCCTCAAT
#GCTTCAAAACAGTATTTCACACTTTTTAGCTCCACAGAAGAGGCAAGTCGTAACATGGTAAGTATACCGG
#AAGGTATGCTTGGAATCAAAATGTAACTTAAAATAAAGTATTTCGCTTACACCGAAAAAACATCAGTGTA
#ACCCTGATCATTTTGAGCTAAAAATTTAGCCCCACCCATCCGCATATTCTTTCCGTATTTATTCCTATAA
#ACTAAAACATTTTTTATATTTAGTAAAGGCGATTAAAAAATTTTTATGGAGCAATAACCACAGTACCGCA
#AGGGAAAAATGAAATAAAAATGAAACATTCTCAAGCATAAAAAAGTAGAGCTAAATCCTCGTACCTTTTG
#CATCATGATTTAACTAGTCTAATAAAGCAAAAAGATTTTTAGTTTTAATTCCCGAACCTAGACGAGCTAT
#...
This sequence contains both coding (i.e., translatable to AA sequences) and non-coding sequences (i.e., ribosomal and tRNAs). We can save these sequences in a text file in your working directory.
We can append a nucleotide sequence to this vector that can be directly translated to AA sequences
library(rentrez)
# Download some COI sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene COI
<- "Epipedobates anthonyi[Organism] AND COI[Gene]"
COX1_nuc <- entrez_search(db="nuccore", term= COX1_nuc)
COX1_IDs <- entrez_fetch(db="nuccore", id=COX1_IDs$ids, rettype="fasta")
COX1_nuc_fasta
## append to ND2 vector
<- c(froggy_seqs_nuc_fasta,COX1_nuc_fasta)
more_nuc_fasta cat(more_nuc_fasta)
We can save these sequences in a text file in your working directory.
# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")
write(more_nuc_fasta, "Eanthonyi_ND2_nuc_seqs.fasta")
31.3 Importing AA sequences in R
We can import the amino acid sequences of from a file in FASTA format. We need to load the R-package Biostrings. The file can also be downloaded from the GitHub repository for this class (i.e., Eanthonyi_COI_prot_seqs.fasta
and Eanthonyi_ND2_nuc_seqs.fasta
).
## load biostring
library(Biostrings)
## we can import if we have sequences in FASTA
<- readAAStringSet(filepath = "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_COI_prot_seqs.fasta",
my_AA_Biostrings_set format = "fasta")
my_AA_Biostrings_set#AAStringSet object of length 2:
# width seq names
#[1] 513 VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV QQL01627.1 cytoch...
#[2] 219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP...QTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF ABG67395.1 cytoch...
31.4 Importing nucleotide sequences in R
We can import nucleotide sequences of from a file in FASTA or FASTQ formats. The last sequence format is similat to FASTA but include sequence (nucleotides) and its corresponding quality scores from the sequencing experiment for more information see here. The file can also be downloaded from the GitHub repository for this class (i.e., Eanthonyi_COI_prot_seqs.fasta
and Eanthonyi_ND2_nuc_seqs.fasta
).
<- readDNAStringSet(filepath = "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_ND2_nuc_seqs.fasta",
my_nuc_Biostrings_set format = "fasta")
my_nuc_Biostrings_set#DNAStringSet object of length 3:
# width seq names
#[1] 4886 AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTC...TCACTCTCTGCCTTCTTCCTATTTCCCCATTTTTTTTACTATTTATTAGAAACTTA HQ290995.1 Epiped...
#[2] 1539 GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATA...TATATGGCTCCCCTCCTCCTTATCACACATTTGAGGAAGCTGTTTATTCTAAAGTA MW042033.1 Epiped...
#[3] 658 AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCC...CCACTTTCTTTGACCCAGCAGGTGGAGGTGACCCCATTCTTTATCAACATCTCTTC DQ502853.1 Epiped...
31.5 Translating a nucleotides to AA
We need to translate nucleotide to AA sequences to further explore protein properties. For this purpose, we use the function translate()
from Biostrings.
Yet, you need to make sure that you know the genetic code. The most common are the “SGC0” for the standard genetic code or “SGC1” for the vertebrate mitochondrial code.
In our example, the sequeces were derived from a mitochondrial genome of a frog (so, we should use “SGC1”). Notice that we need to define the genetic code with the function getGeneticCode()
.
## Vertebrate Mitochondrial code
<- getGeneticCode("SGC1")
SGC1
SGC1#TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG
#"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M"
#ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG
#"T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "*" "*" "V" "V" "V" "V" "A" "A" "A" "A" "D" "D" "E" "E" "G" "G" "G" "G"
#attr(,"alt_init_codons")
#[1] "ATT" "ATC" "GTG"
## Translating DNA/RNA:
<- Biostrings::translate(my_nuc_Biostrings_set, genetic.code= SGC1, no.init.codon=FALSE)
my_aa_translated_Biostrings_set
my_aa_translated_Biostrings_set#AAStringSet object of length 3:
# width seq names
#[1] 1628 KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFS...FFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET HQ290995.1 Epiped...
#[2] 513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3] 219 NPMFSIWGMSRNS*NSPKPLNSS*IKSTWVLTGRWPDL*CNRNCPCFRHNLLHGYA...PNSFICLISSNHRSTSSSFSSCPSC*NYNAFNWSKP*HHFLWPS*W*WPHSLSTSL DQ502853.1 Epiped...
If we check each element in the AAstringset object we notice something particular
<- as.character(unlist(my_aa_translated_Biostrings_set[1]))
seqs_as_vector_1
seqs_as_vector_1#[1] "KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFSSD*HWA*AMAWLS*...
<- as.character(unlist(my_aa_translated_Biostrings_set[2]))
seqs_as_vector_2
seqs_as_vector_2#[1] "MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVM...
<- as.character(unlist(my_aa_translated_Biostrings_set[3]))
seqs_as_vector_3
seqs_as_vector_3#[1] "NPMFSIWGMSRNS*NSPKPLNSS*IKSTWVLTGRWPDL*CNRNCPCFRHNLLHGYAYSNWWIWKLTGTFNN...
Only sequence seqs_as_vector_2
or the second line has no *
or stop codons. The other two: seqs_as_vector_1
and seqs_as_vector_3
have lots of *
or stop codons.
The reason is that seqs_as_vector_1
has ribosomal and tRNAs in the sequence and we have to isolate the coding regions (CDS).
For seqs_as_vector_3
the reason is that the starting base is not the first codon position and we have to shift to the correct one to do the translation.
To fix the translation in seqs_as_vector_3
is easier, we need to indicate that start codon is in position .
## third sequence
<- as.character(unlist(my_nuc_Biostrings_set[3]))
nuc_seq_3
## remove first character
<- sub('.', '', nuc_seq_3)
nuc_seq_3_red
## original versus removed
nuc_seq_3#[1] "AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGA...
nuc_seq_3_red #[1] "ACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGA...
## create a new string set and translate
<- getGeneticCode("SGC1")
SGC1 <- DNAStringSet(nuc_seq_3_red)
nuc_seq_3_red <- Biostrings::translate(nuc_seq_3_red, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_3_red_translated
nuc_seq_3_red_translated#AAStringSet object of length 1:
# width seq
#[1] 219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLV...NMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF
# save as vector and replace in original AAStringSet
<- as.character(unlist(nuc_seq_3_red_translated))
AA_seq_3
AA_seq_3 #[1] "TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP....
3] <- AA_seq_3
my_aa_translated_Biostrings_set[
my_aa_translated_Biostrings_set#AAStringSet object of length 3:
# width seq names
#[1] 1628 KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFS...FFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET HQ290995.1 Epiped...
#[2] 513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3] 219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP...QTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF DQ502853.1 Epiped...
We can repeat this process with the first sequence that also has stop codons.
## first sequence
<- as.character(unlist(my_nuc_Biostrings_set[1]))
nuc_seq_1
## remove first character
<- sub('.', '', nuc_seq_1)
nuc_seq_1_red
## original
nuc_seq_1#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...
## removed
nuc_seq_1_red#[1] "AGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAA...
## create a new string set and translate
<- DNAStringSet(nuc_seq_1_red)
nuc_seq_1_red <- Biostrings::translate(nuc_seq_1_red, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red_translated
nuc_seq_1_red_translated#AAStringSet object of length 1:
# width seq
#[1] 1628 *FGPSLEISYYLMYTCKSPHPCENALILPY*EQEPVSGTNFCP*HLASPHPQGTSAVINIEHKR*LDSVKLN*AGQSGASRRGYTTVAQIDF...INWLFTQMTYCS*NN*AKSHYIYFFYPYFLPSKPIFLSSPNMYYIPNTST*LILLPIILKF*QKK*FIYSCPPHSLPSSYFPIFFTIY*KL
## how many stop codons
<- as.character(unlist(nuc_seq_1_red_translated[1]))
nuc_seq_AA_1
## require stringr package
require(stringr)
<-str_count(nuc_seq_AA_1, fixed("*"))
total_stop_codons <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_1 is this number: ", total_stop_codons)
total_stop_codons_report
total_stop_codons_report#[1] "Count for character stop codons -- * -- in nuc_seq_AA_1 is this number: 115"
## still bad and lots stops codons, remove another one base
<- sub('.', '', nuc_seq_1_red)
nuc_seq_1_red2
## original
nuc_seq_1#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...
## removed
nuc_seq_1_red2#[1] "GGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAAC...
## create a new string set and translate
<- DNAStringSet(nuc_seq_1_red2)
nuc_seq_1_red2 <- Biostrings::translate(nuc_seq_1_red2, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red2_translated
nuc_seq_1_red2_translated#AAStringSet object of length 1:
# width seq
#[1] 1628 GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQWLTLSMSDSLTQLN*TEPANLVPAAAVTPLWLKLIS...LTGFLPKWLIAQEMIKQNLTMFTFFILISSLLSLFFYLRLTYTMSLTLAPNSSFSPSSWNFNKKNNLSIPALLTLCLLPISPFFLLFI*NL
## how many stop codons
<- as.character(unlist(nuc_seq_1_red2_translated[1]))
nuc_seq_AA_2
## require stringr package
<-str_count(nuc_seq_AA_2, fixed("*"))
total_stop_codons2 <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_2 is this number: ", total_stop_codons2)
total_stop_codons_report2
total_stop_codons_report2#[1] "Count for character stop codons -- * -- in nuc_seq_AA_2 is this number: 94"
## still bad and lots stops codons, remove another one base
<- sub('.', '', nuc_seq_1_red2)
nuc_seq_1_red3
## original
nuc_seq_1#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...
## removed
nuc_seq_1_red3#[1] "GTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAACG...
## create a new string set and translate
<- DNAStringSet(nuc_seq_1_red3)
nuc_seq_1_red3 <- Biostrings::translate(nuc_seq_1_red3, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red3_translated
nuc_seq_1_red3_translated#AAStringSet object of length 1:
# width seq
#[1] 1627 VWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFSSD*HWA*AMAWLS*TKQSRPIWCQPPRLHHCGSNWFL...H*LAFYPNDLLLKK*LSKISLYLLFLSLFPPF*AYFFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET
## how many stop codons
<- as.character(unlist(nuc_seq_1_red3_translated[1]))
nuc_seq_AA_3
## require stringr package
<-str_count(nuc_seq_AA_3, fixed("*"))
total_stop_codons3 <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_3 is this number: ", total_stop_codons3)
total_stop_codons_report3
total_stop_codons_report3#[1] "Count for character stop codons -- * -- in nuc_seq_AA_3 is this number: 131"
## the best option seems to be nuc_seq_1_red2 with the translated nuc_seq_1_red2_translated
<- as.character(unlist(nuc_seq_1_red2_translated))
AA_seq_1
AA_seq_1 #[1] "GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQW....
1] <- AA_seq_1
my_aa_translated_Biostrings_set[
my_aa_translated_Biostrings_set#AAStringSet object of length 3:
# width seq names
#[1] 1628 GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQWLTLSMSDSLTQLN*TEPANLVPAA...AQEMIKQNLTMFTFFILISSLLSLFFYLRLTYTMSLTLAPNSSFSPSSWNFNKKNNLSIPALLTLCLLPISPFFLLFI*NL HQ290995.1 Epiped...
#[2] 513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNW...LAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3] 219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGAPDMAFPRM...VSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF DQ502853.1 Epiped...
31.6 Finding ORFs in a nucleotide sequence
In most instances, only segments of sequences contain information that can be translated to long string of AA without stop codons (e.g., functional proteins). Those segments are called Open Reading Frames (ORFs) as defined here and here which is the part of a reading frame that has the ability to be translated to protein (it has not stop codons). In general, you most likely want to find the longest ORFs in a sequence, yet not every ORF has the potential to be translated to a functional protein or be an exon of larger gene. Finding ORFs is crucial to discover novel genes, annotate unknown sequences of DNA, find exons among others. However, you can select to find a CoDing Sequence or CDS which is a continuous reading frame that do not need to begin or end with start or stop codons. These CDSs are defined as sequences of nucleotides (e.g., DNA or RNA) that determine a corresponding one od amino acids in a protein. You should consider that an ORFs and CDSs are related, yet ORFs are a continuous stretch of DNA codons that begins with a start codon and ends at a STOP codon.
There are many tools to find such ORFs or CDSs, we will use the R-packages systemPipeR and it has several vignettes for you to explore.
# install and load systemPipeR
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("systemPipeR")
BiocManager
# load library
library(systemPipeR)
library(rentrez)
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")
# we will work with nuclear coding segment for gene like a hemoglobin of a frog
<- "BT081428"
hb_nuc <- entrez_search(db="nuccore", term= hb_nuc)
hb_IDs <- entrez_fetch(db="nuccore", id=hb_IDs$ids, rettype="fasta")
hb_nuc_fasta
hb_nuc_fasta#[1] ">BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA,
# complete cds\nGGGGGGGACTTTACTTGAGTTGCCAAGCTAATCTAACGTGTACTACACCAACTCCCGATCATCACAATGA\nGTCTATCTGCA...
write(hb_nuc_fasta, "Rana_hb_nuc_seqs.fasta")
# define path to file with sequences in fasta
<- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Rana_hb_nuc_seqs.fasta"
my_nuc_fasta_file_path
# To store ORF sequences (you need indexed genome .fai file):
<- readDNAStringSet(filepath = my_nuc_fasta_file_path,
my_base_nuc_Biostrings_set format = "fasta")
my_base_nuc_Biostrings_set#DNAStringSet object of length 1:
# width seq names
#[1] 677 GGGGGGGACTTTACTTGAGTTGCCAAGCTAATCTAACGTGTACTACACCAACTCCC...AAATTGTCCATCAAATATAAATCTGATGTACTCTTAAATAAAACTGAAAATAATTT BT081428.1 Rana c...
# get longest CDS with option mode = "CDS"
<- predORF(my_base_nuc_Biostrings_set,
my_CDSs n = 1,
type = "grl",
mode = "CDS",
strand = "sense",
longest_disjoint = FALSE,
startcodon = "ATG",
stopcodon = c("TAA", "TAG", "TGA"))
my_CDSs#GRangesList object of length 1:
#$`BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds`
#GRanges object with 1 range and 2 metadata columns:
# seqnames ranges strand | subject_id inframe2end
# <Rle> <IRanges> <Rle> | <integer> <numeric>
# [1] BT081428.1 Rana cate.. 67-495 + | 1 3
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# get CDS sequences
<- getSeq(my_base_nuc_Biostrings_set, my_CDSs)
cds_seqs <- cds_seqs[[1]]
my_CDS_long
my_CDS_long#DNAStringSet object of length 1:
# width seq
#[1] 429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCA...GTTCAGGCTGCCTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA
# name of fasta seq
<- names(cds_seqs)[1]
name_seq <- paste0(">Long_CDS_",name_seq,"\n")
name_seq
# build vector to write
<- as.character(unlist(my_CDS_long))
my_CDS_long_vec <- paste0(name_seq,my_CDS_long_vec)
final_my_CDS_long_vec cat(final_my_CDS_long_vec)
#>Long_CDS_BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds
#ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGACCAGGCTTTTCCTTAGCTTCCCCCA...
# write fasta file
write(final_my_CDS_long_vec, "Rana_hb_CDS_seqs.fasta")
Another approach is with ORFik and it has several vignettes for you to explore.
# install and load ORFik
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("ORFik")
BiocManager
# load library ORFik
library(ORFik)
# define path to file with sequences in fasta
<- "~/Desktop/Teach_R/my_working_directory/Rana_hb_nuc_seqs.fasta"
my_nuc_fasta_file_path
# identify ORFs in fasta file
<- findORFsFasta(my_nuc_fasta_file_path)
orfs
orfs#GRanges object with 31 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] BT081428.1 67-495 +
# [2] BT081428.1 20-31 +
# [3] BT081428.1 77-109 +
# [4] BT081428.1 119-157 +
# [5] BT081428.1 206-214 +
# ... ... ... ...
# [27] BT081428.1 192-281 -
# [28] BT081428.1 282-320 -
# [29] BT081428.1 333-380 -
# [30] BT081428.1 432-587 -
# [31] BT081428.1 612-635 -
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# To store ORF sequences (you need indexed genome .fai file):
<- FaFile(my_nuc_fasta_file_path)
my_nuc_fasta <- getSeq(my_nuc_fasta, orfs)
orf_seqs
orf_seqs#DNAStringSet object of length 31:
# width seq names
# [1] 429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGG...CTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA BT081428.1
# [2] 12 TTGCCAAGCTAA BT081428.1
# [3] 33 CTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAG BT081428.1
# [4] 39 TTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGA BT081428.1
# [5] 9 TTGACCTGA BT081428.1
# ... ... ...
#[27] 90 TTGGCTGCACCTGCAAGAGCATTAATAATCTTTCCACCATGTTTGTTGAGGTCAGCAGAGCCAGGGGTCAGGTCAAAGTGGGGGAAATAA BT081428.1
#[28] 39 CTGAGGGAGGACAGGCTGCCAGCAAGGTCATCCAGATGA BT081428.1
#[29] 48 ATGCGAGCCAGCAGAGGAAAGTTTCCTGGATCCACTCTTAGGTTGTAG BT081428.1
#[30] 156 TTGGTTGTAACTGAAATGGGCCCAAAGAACACTTCATTGCACGTTGTATTACTTGC...CTTGGAGGTGAGGACTGCAGAAACTAGAGCAAGGAATTTGTCCCAGGCAGCCTGA BT081428.1
#[31] 24 TTGATGGACAATTTACACAAGTAA BT081428.1
# order by the longest to shortest ORF
<- orf_seqs[order(width(orf_seqs), decreasing = T),]
orf_seqs_ordered
orf_seqs_ordered #DNAStringSet object of length 31:
# width seq names
# [1] 429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGG...CTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA BT081428.1
# [2] 258 ATGCAGGCCACTGAGGGAGGACAGGCTGCCAGCAAGGTCATCCAGATGATTGGCTG...GAGATCCAATCTTTCCCACTATGGAAAGCACAGCAGCCTTCTCACTTGCAGATAG BT081428.1
# [3] 156 TTGGTTGTAACTGAAATGGGCCCAAAGAACACTTCATTGCACGTTGTATTACTTGC...CTTGGAGGTGAGGACTGCAGAAACTAGAGCAAGGAATTTGTCCCAGGCAGCCTGA BT081428.1
# [4] 138 CTGCAGTCCTCACCTCCAAGTACAGATAAATCCACCCTGTATAACAGGCTACATCA...AATGAAGTGTTCTTTGGGCCCATTTCAGTTACAACCAAGTTGCAATGTTTTTTGA BT081428.1
# [5] 126 CTGGCTCTGCTGACCTCAACAAACATGGTGGAAAGATTATTAATGCTCTTGCAGGT...GATGACCTTGCTGGCAGCCTGTCCTCCCTCAGTGGCCTGCATGCCTACAACCTAA BT081428.1
# ... ... ...
#[27] 12 TTGCCAAGCTAA BT081428.1
#[28] 9 TTGACCTGA BT081428.1
#[29] 9 TTGGTGTAG BT081428.1
#[30] 9 CTGATGTAG BT081428.1
#[31] 6 TTGTAA BT081428.1
# name of fasta seq
<- names(orf_seqs_ordered)[1]
name_seq_orfik <- paste0(">Long_ORF_orfik_",name_seq_orfik,"\n")
name_seq_orfik
# build vector to write
<- as.character(unlist(orf_seqs_ordered[1]))
my_CDS_long_orfik_vec <- paste0(name_seq_orfik,my_CDS_long_orfik_vec)
final_my_CDS_long_orfik_vec cat(final_my_CDS_long_orfik_vec)
#>Long_ORF_orfik_BT081428.1
#ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGACCAGGCTTTTCCTTAGCTTCCCCCAGACCAAGACTTA...
# write fasta file
write(final_my_CDS_long_orfik_vec, "Rana_hb_ORF_orfik_seqs.fasta")
#tools:
#http://csbg.cnb.csic.es/PB/
31.7 Physochemical properties of an amino acid sequence
The physicochemical properties of amino acids are crucial to determine sequence profiles, folding and function. We can get some properties given a continuous amino acid sequence from the R package Peptides and it manual for you to explore particular properties.
# load or install relevant packages
install.packages("seqinr")
install.packages("Peptides")
require(seqinr)
require(Peptides)
## path to fasta files
<- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Rana_hb_CDS_seqs.fasta"
hb_rana_path
## read fasta and translate
<- seqinr::read.fasta(file = hb_rana_path, seqtype = "AA")
nucleotide_hb_rana_fasta
nucleotide_hb_rana_fasta#$Long_CDS_BT081428.1
# [1] "A" "T" "G" "A" "G" "T" "C" "T" "A" "T" "C" "T" "G" "C" "A" "A" "G" "T" "G" "A" "G" "A" "A" "G" "G" "C" "T" "G" "C" "T" "G" "T" "G" "C" "T" "T" "T" "C" "C" "A" "T" "A" "G" "T" "G" "G" "G"
# [48] "A" "A" "A" "G" "A" "T" "T" "G" "G" "A" "T" "C" "T" "C" "A" "G" "G" "C" "C" "A" "G" "T" "G" "C" "T" "C" "T" "G" "G" "G" "C" "T" "C" "T" "G" "A" "G" "G" "C" "T" "T" "T" "G" "A" "C" "C" "A"
# [95] "G" "G" "C" "T" "T" "T" "T" "C" "C" "T" "T" "A" "G" "C" "T" "T" "C" "C" "C" "C" "C" "A" "G" "A" "C" "C" "A" "A" "G" "A" "C" "T" "T" "A" "T" "T" "T" "C" "C" "C" "C" "C" "A" "C" "T" "T" "T"
#[142] "G" "A" "C" "C" "T" "G" "A" "C" "C" "C" "C" "T" "G" "G" "C" "T" "C" "T" "G" "C" "T" "G" "A" "C" "C" "T" "C" "A" "A" "C" "A" "A" "A" "C" "A" "T" "G" "G" "T" "G" "G" "A" "A" "A" "G" "A" "T"
#[189] "T" "A" "T" "T" "A" "A" "T" "G" "C" "T" "C" "T" "T" "G" "C" "A" "G" "G" "T" "G" "C" "A" "G" "C" "C" "A" "A" "T" "C" "A" "T" "C" "T" "G" "G" "A" "T" "G" "A" "C" "C" "T" "T" "G" "C" "T" "G"
#[236] "G" "C" "A" "G" "C" "C" "T" "G" "T" "C" "C" "T" "C" "C" "C" "T" "C" "A" "G" "T" "G" "G" "C" "C" "T" "G" "C" "A" "T" "G" "C" "C" "T" "A" "C" "A" "A" "C" "C" "T" "A" "A" "G" "A" "G" "T" "G"
#[283] "G" "A" "T" "C" "C" "A" "G" "G" "A" "A" "A" "C" "T" "T" "T" "C" "C" "T" "C" "T" "G" "C" "T" "G" "G" "C" "T" "C" "G" "C" "A" "T" "T" "A" "T" "C" "C" "A" "G" "G" "T" "G" "G" "T" "C" "T" "T"
#[330] "G" "G" "C" "T" "A" "C" "T" "C" "A" "C" "T" "T" "C" "C" "C" "T" "G" "G" "T" "G" "A" "T" "T" "T" "C" "A" "C" "T" "G" "C" "T" "G" "A" "G" "G" "T" "T" "C" "A" "G" "G" "C" "T" "G" "C" "C" "T"
#[377] "G" "G" "G" "A" "C" "A" "A" "A" "T" "T" "C" "C" "T" "T" "G" "C" "T" "C" "T" "A" "G" "T" "T" "T" "C" "T" "G" "C" "A" "G" "T" "C" "C" "T" "C" "A" "C" "C" "T" "C" "C" "A" "A" "G" "T" "A" "C"
#[424] "A" "G" "A" "T" "A" "A"
#attr(,"name")
#[1] "Long_CDS_BT081428.1"
#attr(,"Annot")
#[1] ">Long_CDS_BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds"
#attr(,"class")
#[1] "SeqFastaAA"
## some squence manipulation
<- paste0(unlist(nucleotide_hb_rana_fasta), collapse ="")
nucleotide_hb_rana_fasta_vec <- tolower(nucleotide_hb_rana_fasta_vec)
nucleotide_hb_rana_fasta_vec <- seqinr::s2c(nucleotide_hb_rana_fasta_vec)
nucleotide_hb_rana_fasta_vec
nucleotide_hb_rana_fasta_vec# [1] "a" "t" "g" "a" "g" "t" "c" "t" "a" "t" "c" "t" "g" "c" "a" "a" "g" "t" "g" "a" "g" "a" "a" "g" "g" "c" "t" "g" "c" "t" "g" "t" "g" "c" "t" "t" "t" "c" "c" "a" "t" "a" "g" "t" "g" "g" "g"
# [48] "a" "a" "a" "g" "a" "t" "t" "g" "g" "a" "t" "c" "t" "c" "a" "g" "g" "c" "c" "a" "g" "t" "g" "c" "t" "c" "t" "g" "g" "g" "c" "t" "c" "t" "g" "a" "g" "g" "c" "t" "t" "t" "g" "a" "c" "c" "a"
# [95] "g" "g" "c" "t" "t" "t" "t" "c" "c" "t" "t" "a" "g" "c" "t" "t" "c" "c" "c" "c" "c" "a" "g" "a" "c" "c" "a" "a" "g" "a" "c" "t" "t" "a" "t" "t" "t" "c" "c" "c" "c" "c" "a" "c" "t" "t" "t"
#[142] "g" "a" "c" "c" "t" "g" "a" "c" "c" "c" "c" "t" "g" "g" "c" "t" "c" "t" "g" "c" "t" "g" "a" "c" "c" "t" "c" "a" "a" "c" "a" "a" "a" "c" "a" "t" "g" "g" "t" "g" "g" "a" "a" "a" "g" "a" "t"
#[189] "t" "a" "t" "t" "a" "a" "t" "g" "c" "t" "c" "t" "t" "g" "c" "a" "g" "g" "t" "g" "c" "a" "g" "c" "c" "a" "a" "t" "c" "a" "t" "c" "t" "g" "g" "a" "t" "g" "a" "c" "c" "t" "t" "g" "c" "t" "g"
#[236] "g" "c" "a" "g" "c" "c" "t" "g" "t" "c" "c" "t" "c" "c" "c" "t" "c" "a" "g" "t" "g" "g" "c" "c" "t" "g" "c" "a" "t" "g" "c" "c" "t" "a" "c" "a" "a" "c" "c" "t" "a" "a" "g" "a" "g" "t" "g"
#[283] "g" "a" "t" "c" "c" "a" "g" "g" "a" "a" "a" "c" "t" "t" "t" "c" "c" "t" "c" "t" "g" "c" "t" "g" "g" "c" "t" "c" "g" "c" "a" "t" "t" "a" "t" "c" "c" "a" "g" "g" "t" "g" "g" "t" "c" "t" "t"
#[330] "g" "g" "c" "t" "a" "c" "t" "c" "a" "c" "t" "t" "c" "c" "c" "t" "g" "g" "t" "g" "a" "t" "t" "t" "c" "a" "c" "t" "g" "c" "t" "g" "a" "g" "g" "t" "t" "c" "a" "g" "g" "c" "t" "g" "c" "c" "t"
#[377] "g" "g" "g" "a" "c" "a" "a" "a" "t" "t" "c" "c" "t" "t" "g" "c" "t" "c" "t" "a" "g" "t" "t" "t" "c" "t" "g" "c" "a" "g" "t" "c" "c" "t" "c" "a" "c" "c" "t" "c" "c" "a" "a" "g" "t" "a" "c"
#[424] "a" "g" "a" "t" "a" "a"
## tranlation is in the defaul nucleotide codon code: numcode = 1
<- seqinr::translate(seq = nucleotide_hb_rana_fasta_vec, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
AA_hb_rana_fasta
AA_hb_rana_fasta# [1] "M" "S" "L" "S" "A" "S" "E" "K" "A" "A" "V" "L" "S" "I" "V" "G" "K" "I" "G" "S" "Q" "A" "S" "A" "L" "G" "S" "E" "A" "L" "T" "R" "L" "F" "L" "S" "F" "P" "Q" "T" "K" "T" "Y" "F" "P" "H" "F"
# [48] "D" "L" "T" "P" "G" "S" "A" "D" "L" "N" "K" "H" "G" "G" "K" "I" "I" "N" "A" "L" "A" "G" "A" "A" "N" "H" "L" "D" "D" "L" "A" "G" "S" "L" "S" "S" "L" "S" "G" "L" "H" "A" "Y" "N" "L" "R" "V"
# [95] "D" "P" "G" "N" "F" "P" "L" "L" "A" "R" "I" "I" "Q" "V" "V" "L" "A" "T" "H" "F" "P" "G" "D" "F" "T" "A" "E" "V" "Q" "A" "A" "W" "D" "K" "F" "L" "A" "L" "V" "S" "A" "V" "L" "T" "S" "K" "Y"
#[142] "R" "*"
<- seqinr::c2s(AA_hb_rana_fasta)
AA_hb_rana_fasta_string
AA_hb_rana_fasta_string#[1] "MSLSASEKAAVLSIVGKIGSQASALGSEALTRLFLSFPQTKTYFPHFDLTPGSADLNKHGGKIINALAGAANHLDDLAGSLSSLSGLHAYNLRVDPGNFPLLARIIQVVLATHFPGDFTAEVQAAWDKFLALVSAVLTSKYR*"
## remove stop codons
<- gsub("*","",AA_hb_rana_fasta_string, fixed = TRUE)
AA_hb_rana_fasta_string_not_stop
AA_hb_rana_fasta_string_not_stop#[1] "MSLSASEKAAVLSIVGKIGSQASALGSEALTRLFLSFPQTKTYFPHFDLTPGSADLNKHGGKIINALAGAANHLDDLAGSLSSLSGLHAYNLRVDPGNFPLLARIIQVVLATHFPGDFTAEVQAAWDKFLALVSAVLTSKYR"
## 'Peptides' properties
::boman(AA_hb_rana_fasta_string_not_stop)
Peptides#[1] 0.618662
::charge(AA_hb_rana_fasta_string_not_stop, pH= 7.8, pKscale = "EMBOSS")
Peptides#[1] 1.081336
::charge(AA_hb_rana_fasta_string_not_stop, pH= 7, pKscale = "EMBOSS")
Peptides#[1] 2.182985
::hydrophobicity(AA_hb_rana_fasta_string_not_stop,"KyteDoolittle")
Peptides#[1] 0.2619718
::pI(AA_hb_rana_fasta_string_not_stop,"EMBOSS")
Peptides#[1] 8.974793
For the COX1 subunit that already was encoded as an amino acid sequence.
# load relevant packages
require(seqinr)
require(Peptides)
## path to fasta files
<- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_COI_prot_seqs.fasta"
COI_Eanthonyi_path
## read fasta and translate only the first sequence
<- seqinr::read.fasta(file = COI_Eanthonyi_path, seqtype = "AA", as.string = TRUE)
AA_COI_Eanthonyi_fasta <- AA_COI_Eanthonyi_fasta[1]
AA_COI_Eanthonyi_fasta_use
AA_COI_Eanthonyi_fasta_use#$QQL01627.1
#[1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASAGVEAGAGTGWTVYPPLAGNLAHAGPSVDLTIFSLHLAGVSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHVVTFYSSKKEPFGYMGMVWAMMSIGLLGFIVWAHHMFTTDLNVDTRAYFTSATMIIAIPTGVKVFSWLATMHGGIIKWDAAMLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMAGFVHWFPLFTGFTLHEAWTKIHFGVMFAGVNLTFFPQHFLGLAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV"
#attr(,"name")
#[1] "QQL01627.1"
#attr(,"Annot")
#[1] ">QQL01627.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]"
#attr(,"class")
#[1] "SeqFastaAA"
## some squence manipulation
<- seqinr::c2s(AA_COI_Eanthonyi_fasta_use)
A_COI_Eanthonyi_fasta_vec
A_COI_Eanthonyi_fasta_vec# [1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGA...
## remove stop codons
<- gsub("*","",A_COI_Eanthonyi_fasta_vec, fixed = TRUE)
A_COI_Eanthonyi_fasta_vec_not_stop
A_COI_Eanthonyi_fasta_vec_not_stop# [1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGA...
## 'Peptides' properties
::boman(A_COI_Eanthonyi_fasta_vec_not_stop)
Peptides#[1] -0.4054776
::charge(A_COI_Eanthonyi_fasta_vec_not_stop, pH= 7.8, pKscale = "EMBOSS")
Peptides#[1] -7.372694
::charge(A_COI_Eanthonyi_fasta_vec_not_stop, pH= 7, pKscale = "EMBOSS")
Peptides#[1] -3.690809
::hydrophobicity(A_COI_Eanthonyi_fasta_vec_not_stop,"KyteDoolittle")
Peptides#[1] 0.7395712
::pI(A_COI_Eanthonyi_fasta_vec_not_stop,"EMBOSS")
Peptides#[1] 6.601388