Session 31 – Working with Proteins

Most bioinformatics uses sequences from proteins derived from protein sequencing or translation (DNA/RNA to protein) from nucleotide sequences. Such protein analyses usually include sequence analyses (e.g., amino acid composition), protein 2D-3D structure analyses, and comparison between organisms or species (evolution analyses). At the most basic stage, we can compare an amino acid sequence of an unknown protein to those in the ever-growing databases of proteins with known functions. This approach will help us to identify and possible infer protein function, location of expression (e.g., cell region or tissue), and new/novel changes on the amino acid sequence. Protein structure analyses usually involve complex mathematical calculations and biophysical studies to infer how protein looks in a living organism. Using bioinformatics can help us to infer the folding and stability of proteins and to predict the secondary structures from amino acid sequences. Lastly, as we explore the diversity of life as more and more species have their genomes/transcriptomes sequenced, we might want to infer the connection between the phenotype (how an organism looks like or its observable characteristics) and the underlying genotype (genome and gene sequences) as there are expressed in response to the environment. For this purpose, we can use phylogenetic methods to find homologous proteins and similar protein regions that might perform similar enzymatic functions. In this session, we will explore how to work with amino acid sequences.

31.1 Sequences in FASTA format

Nucleotide and amino sequences can be encoded in FASTA that is a text-based format where a single letter represent a nucleotide or amino acid. FASTA has a standard configuration that includes two main components: (1) A single-line description that starts with “>” symbol. This usually includes information for the user to identify the gene or protein sequences that might follow in the next line; and (2) A single-line or wrapped text that includes the nucleotide or amino acid sequence. This second line includes very specific codes for nucleotides and amino acids in the standard IUB/IUPAC and additional codes.

The nucleic acid codes supported are:

Nucleic Acid Code Meaning Mnemonic
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
(i) i inosine (non-standard)
R A or G (I) puRine
Y C, T or U pYrimidines
K G, T or U bases which are Ketones
M A or C bases with aMino groups
S C or G Strong interaction
W A, T or U Weak interaction
B not A (i.e. C, G, T or U) B comes after A
D not C (i.e. A, G, T or U) D comes after C
H not G (i.e., A, C, T or U) H comes after G
V neither T nor U (i.e. A, C or G) V comes after U
N A C G T U Nucleic acid
- gap of indeterminate length

The amino acid codes supported (22 amino acids and 3 special codes) are:

Amino Acid Code Symbol Meaning
A ALA alanine
B ASX aspartate or asparagine
C CYS cystine
D ASP aspartate
E GLU glutamate
F PHE phenylalanine
G GLY glycine
H HIS histidine
I ILE isoleucine
K LYS lysine
L LEU leucine
M MET methionine
N ASN asparagine
P PRO proline
Q GLN glutamine
R ARG arginine
S SER serine
T THR threonine
U selenocysteine
V VAL valine
W TRP tryptophan
Y TYR tyrosine
Z GLX glutamate or glutamine
X any
* translation stop
- gap of indeterminate length

More information can be found here.

31.2 Fetching AA sequences from NCBI

1) To import amino acid (AA) sequences into R, we want to read a file with AA sequences to create a character vector contains text (e.g., sequence of letters for nucleotide or amino acid sequence). In R, you can easily interact with NCBI database and fetch sequences as vectors without the need to save those sequences into a file. For example, we will download protein sequences of a COX1 mitochondrial gene of Epipedobates anthonyi.

We will follow the example from previous sessions using the R-package rentrez.

library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")

# Download some amino sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene COI
froggy_name_COX1_prot <- "Epipedobates anthonyi[Organism] AND COI[Gene]"
froggy_seq_IDs_prot <- entrez_search(db="protein", term= froggy_name_COX1_prot)
froggy_seq_IDs_prot
#Entrez search result with 2 hits (object contains 2 IDs and no web_history object)
# Search term (as translated):  "Epipedobates anthonyi"[Organism] AND COI[Gene] 
 

# revising the structure of 'froggy_seq_IDs_prot' that there are 2 sequence in NCBI protein database
str(froggy_seq_IDs_prot)
#List of 5
# $ ids             : chr [1:2] "1952638368" "110332819"
# $ count           : int 2
# $ retmax          : int 2
# $ QueryTranslation: chr "\"Epipedobates anthonyi\"[Organism] AND COI[Gene]"
# $ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
# - attr(*, "class")= chr [1:2] "esearch" "list"

froggy_seqs_prot_fasta <- entrez_fetch(db="protein", id=froggy_seq_IDs_prot$ids, rettype="fasta")
cat(froggy_seqs_prot_fasta)
#>QQL01627.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]
#VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMV
#MPILIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASAGVEAGAGTGWTVYPPLAGNLAHAG
#PSVDLTIFSLHLAGVSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
#TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHVVTFYSSKKEPFGYMGMVWAMMSI
#GLLGFIVWAHHMFTTDLNVDTRAYFTSATMIIAIPTGVKVFSWLATMHGGIIKWDAAMLWALGFIFLFTV
#GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMAGFVHWFPLFTGFTLHEAWTKIHFGVMFAG
#VNLTFFPQHFLGLAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPT
#NVEWLYGSPPPYHTFEEAVYSKV
#
#>ABG67395.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]
#TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLM
#IGAPDMAFPRMNNMSFWLLPPSFLLLLASTGVEAGAGTGWTVYPPLAGNLAHAGPSVDLTIFSLHLAGVS
#SILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGG
#DPILYQHLF

We can save these sequences in a text file in your working directory.

# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")
write(froggy_seqs_prot_fasta, "Eanthonyi_COI_prot_seqs.fasta")

2) Alternatively, you can import nucleotide sequences into R and then translate them to protein. As above, we want to read a file with nucleotide sequences to create a character vector contains text (e.g., sequence of letters for nucleotide sequences). In R, you can easily interact with NCBI database and fetch sequences as vectors without the need to save those sequences into a file.

We will follow the example from previous sessions using the R-package rentrez. However, we will work with a more complicated sequence for the mitochondrial ND2 gene.

library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")

# Download some amino sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene ND2
froggy_name_ND2_nuc <- "Epipedobates anthonyi[Organism] AND ND2[Gene]"
froggy_seq_IDs_nuc <- entrez_search(db="nuccore", term= froggy_name_ND2_nuc)
froggy_seq_IDs_nuc
#Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
# Search term (as translated):  "Epipedobates anthonyi"[Organism] AND ND2[Gene] 
 

# revising the structure of 'froggy_seq_IDs_nuc' that there are 1 sequence in NCBI protein database
str(froggy_seq_IDs_nuc)
#List of 5
# $ ids             : chr "328728262"
# $ count           : int 1
# $ retmax          : int 1
# $ QueryTranslation: chr "\"Epipedobates anthonyi\"[Organism] AND ND2[Gene]"
# $ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
# - attr(*, "class")= chr [1:2] "esearch" "list"

froggy_seqs_nuc_fasta <- entrez_fetch(db="nuccore", id=froggy_seq_IDs_nuc$ids, rettype="fasta")
cat(froggy_seqs_nuc_fasta)
#>HQ290995.1 Epipedobates anthonyi 12S ribosomal RNA gene, partial sequence; tRNA-Val, 16S ribosomal RNA, and tRNA-Leu genes, complete sequence; NADH dehydrogenase subunit 1 (ND1) gene, partial cds; tRNA-Ile, tRNA-Gln, and tRNA-Met genes, complete sequence; and NADH dehydrogenase subunit 2 (ND2) gene, complete cds; mitochondrial
#AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAA
#AACGCCCTTATTCTCCCCTACAGGGAGCAGGAGCCGGTATCAGGCACAAATTTTTGCCCATAACACCTAG
#CATCGCCACACCCACAAGGGACTTCAGCAGTGATTAACATTGAGCATAAGCGATAGCTTGACTCAGTTAA
#ACTAAACAGAGCCGGCCAATCTGGTGCCAGCCGCCGCGGTTACACCACTGTGGCTCAAATTGATTTCTAT
#CGGCGTAAAGCGTGATTAAAGATTAATCTTTTGGAGTTAAACTAAAATTAAGCTGTGACACGCTTATTAC
#CTAAGAAAATCATAAACGAAAGCTACTCCAATATTACCAACTTGAATTCACGACAACTGAGGAACAAACT
#GGGATTAGATACCCCACTATGCTCAGTCGTAAACTTTAACTTACATCTTTCTACTCGCCAGGGAACTACG
#AGCTAAGCTTAAAACCCAAAGGACTTGACGGTACCCCATATCCCCCTAGAGGAGCCTGTCCTATAATCGA
#TAACCCCCGTTTTACCTCACCATTTTTTGTTAATCAGCCTGTATACCTCCGTCGTCAGCTTACCACGTGA
#GCGTGTGAGCTAAATGTTTTTTTCAACCACACGTCAGGTCAAGGTGCAGCAAATAAAATGGAAAGAGATG
#GGCTACACTCTCTATTCTAGAATAAACAAAAGACTAAATGAAACCCGGTCAGAAGGCGGATTTAGCAGTA
#AAATGAAACTAGAGCGTTCATTTAAACATGGCACTGGGGTGTGTACACACCGCCCGTCACCCTCCTCAAT
#GCTTCAAAACAGTATTTCACACTTTTTAGCTCCACAGAAGAGGCAAGTCGTAACATGGTAAGTATACCGG
#AAGGTATGCTTGGAATCAAAATGTAACTTAAAATAAAGTATTTCGCTTACACCGAAAAAACATCAGTGTA
#ACCCTGATCATTTTGAGCTAAAAATTTAGCCCCACCCATCCGCATATTCTTTCCGTATTTATTCCTATAA
#ACTAAAACATTTTTTATATTTAGTAAAGGCGATTAAAAAATTTTTATGGAGCAATAACCACAGTACCGCA
#AGGGAAAAATGAAATAAAAATGAAACATTCTCAAGCATAAAAAAGTAGAGCTAAATCCTCGTACCTTTTG
#CATCATGATTTAACTAGTCTAATAAAGCAAAAAGATTTTTAGTTTTAATTCCCGAACCTAGACGAGCTAT
#...

This sequence contains both coding (i.e., translatable to AA sequences) and non-coding sequences (i.e., ribosomal and tRNAs). We can save these sequences in a text file in your working directory.

We can append a nucleotide sequence to this vector that can be directly translated to AA sequences

library(rentrez)

# Download some COI sequences from NCBI for Epipedobates anthonyi[Organism] that correspond to the gene COI
COX1_nuc <- "Epipedobates anthonyi[Organism] AND COI[Gene]"
COX1_IDs <- entrez_search(db="nuccore", term= COX1_nuc)
COX1_nuc_fasta <- entrez_fetch(db="nuccore", id=COX1_IDs$ids, rettype="fasta")

## append to ND2 vector

more_nuc_fasta <- c(froggy_seqs_nuc_fasta,COX1_nuc_fasta)
cat(more_nuc_fasta)

We can save these sequences in a text file in your working directory.

# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")
write(more_nuc_fasta, "Eanthonyi_ND2_nuc_seqs.fasta")

31.3 Importing AA sequences in R

We can import the amino acid sequences of from a file in FASTA format. We need to load the R-package Biostrings. The file can also be downloaded from the GitHub repository for this class (i.e., Eanthonyi_COI_prot_seqs.fasta and Eanthonyi_ND2_nuc_seqs.fasta).

## load biostring
library(Biostrings)

## we can import if we have sequences in FASTA

my_AA_Biostrings_set <- readAAStringSet(filepath = "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_COI_prot_seqs.fasta", 
                                         format = "fasta")
my_AA_Biostrings_set
#AAStringSet object of length 2:
#    width seq                                                                                                                 names               
#[1]   513 VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV QQL01627.1 cytoch...
#[2]   219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP...QTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF ABG67395.1 cytoch...

31.4 Importing nucleotide sequences in R

We can import nucleotide sequences of from a file in FASTA or FASTQ formats. The last sequence format is similat to FASTA but include sequence (nucleotides) and its corresponding quality scores from the sequencing experiment for more information see here. The file can also be downloaded from the GitHub repository for this class (i.e., Eanthonyi_COI_prot_seqs.fasta and Eanthonyi_ND2_nuc_seqs.fasta).

my_nuc_Biostrings_set <- readDNAStringSet(filepath = "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_ND2_nuc_seqs.fasta", 
                                         format = "fasta")
my_nuc_Biostrings_set
#DNAStringSet object of length 3:
#    width seq                                                                                                                 names               
#[1]  4886 AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTC...TCACTCTCTGCCTTCTTCCTATTTCCCCATTTTTTTTACTATTTATTAGAAACTTA HQ290995.1 Epiped...
#[2]  1539 GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATA...TATATGGCTCCCCTCCTCCTTATCACACATTTGAGGAAGCTGTTTATTCTAAAGTA MW042033.1 Epiped...
#[3]   658 AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCC...CCACTTTCTTTGACCCAGCAGGTGGAGGTGACCCCATTCTTTATCAACATCTCTTC DQ502853.1 Epiped...

31.5 Translating a nucleotides to AA

We need to translate nucleotide to AA sequences to further explore protein properties. For this purpose, we use the function translate() from Biostrings. Yet, you need to make sure that you know the genetic code. The most common are the “SGC0” for the standard genetic code or “SGC1” for the vertebrate mitochondrial code. In our example, the sequeces were derived from a mitochondrial genome of a frog (so, we should use “SGC1”). Notice that we need to define the genetic code with the function getGeneticCode().

## Vertebrate Mitochondrial code
SGC1 <- getGeneticCode("SGC1")
SGC1
#TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG 
#"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "W" "W" "L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "M" "M" 
#ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG 
#"T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "*" "*" "V" "V" "V" "V" "A" "A" "A" "A" "D" "D" "E" "E" "G" "G" "G" "G" 
#attr(,"alt_init_codons")
#[1] "ATT" "ATC" "GTG"

## Translating DNA/RNA:
my_aa_translated_Biostrings_set <- Biostrings::translate(my_nuc_Biostrings_set, genetic.code= SGC1, no.init.codon=FALSE)
my_aa_translated_Biostrings_set
#AAStringSet object of length 3:
#    width seq                                                                                                                 names               
#[1]  1628 KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFS...FFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET HQ290995.1 Epiped...
#[2]   513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3]   219 NPMFSIWGMSRNS*NSPKPLNSS*IKSTWVLTGRWPDL*CNRNCPCFRHNLLHGYA...PNSFICLISSNHRSTSSSFSSCPSC*NYNAFNWSKP*HHFLWPS*W*WPHSLSTSL DQ502853.1 Epiped...

If we check each element in the AAstringset object we notice something particular

seqs_as_vector_1 <- as.character(unlist(my_aa_translated_Biostrings_set[1]))
seqs_as_vector_1
#[1] "KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFSSD*HWA*AMAWLS*...

seqs_as_vector_2 <- as.character(unlist(my_aa_translated_Biostrings_set[2]))
seqs_as_vector_2
#[1] "MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVM...

seqs_as_vector_3 <- as.character(unlist(my_aa_translated_Biostrings_set[3]))
seqs_as_vector_3
#[1] "NPMFSIWGMSRNS*NSPKPLNSS*IKSTWVLTGRWPDL*CNRNCPCFRHNLLHGYAYSNWWIWKLTGTFNN...

Only sequence seqs_as_vector_2 or the second line has no * or stop codons. The other two: seqs_as_vector_1 and seqs_as_vector_3 have lots of * or stop codons. The reason is that seqs_as_vector_1 has ribosomal and tRNAs in the sequence and we have to isolate the coding regions (CDS). For seqs_as_vector_3 the reason is that the starting base is not the first codon position and we have to shift to the correct one to do the translation.

To fix the translation in seqs_as_vector_3 is easier, we need to indicate that start codon is in position .

## third sequence
nuc_seq_3 <- as.character(unlist(my_nuc_Biostrings_set[3]))

## remove first character
nuc_seq_3_red <- sub('.', '', nuc_seq_3)

## original versus removed
nuc_seq_3
#[1] "AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGA...
nuc_seq_3_red 
#[1] "ACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGA...

## create a new string set and translate

SGC1 <- getGeneticCode("SGC1")
nuc_seq_3_red <- DNAStringSet(nuc_seq_3_red)
nuc_seq_3_red_translated <- Biostrings::translate(nuc_seq_3_red, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_3_red_translated
#AAStringSet object of length 1:
#    width seq
#[1]   219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLV...NMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF

# save as vector and replace in original AAStringSet
AA_seq_3 <- as.character(unlist(nuc_seq_3_red_translated))
AA_seq_3 
#[1] "TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP....

my_aa_translated_Biostrings_set[3] <- AA_seq_3
my_aa_translated_Biostrings_set
#AAStringSet object of length 3:
#    width seq                                                                                                                 names               
#[1]  1628 KVWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFS...FFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET HQ290995.1 Epiped...
#[2]   513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNV...SLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3]   219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMP...QTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF DQ502853.1 Epiped...

We can repeat this process with the first sequence that also has stop codons.

## first sequence
nuc_seq_1 <- as.character(unlist(my_nuc_Biostrings_set[1]))

## remove first character
nuc_seq_1_red <- sub('.', '', nuc_seq_1)

## original
nuc_seq_1
#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...

## removed
nuc_seq_1_red
#[1] "AGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAA...

## create a new string set and translate
nuc_seq_1_red <- DNAStringSet(nuc_seq_1_red)
nuc_seq_1_red_translated <- Biostrings::translate(nuc_seq_1_red, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red_translated
#AAStringSet object of length 1:
#    width seq
#[1]  1628 *FGPSLEISYYLMYTCKSPHPCENALILPY*EQEPVSGTNFCP*HLASPHPQGTSAVINIEHKR*LDSVKLN*AGQSGASRRGYTTVAQIDF...INWLFTQMTYCS*NN*AKSHYIYFFYPYFLPSKPIFLSSPNMYYIPNTST*LILLPIILKF*QKK*FIYSCPPHSLPSSYFPIFFTIY*KL

## how many stop codons
nuc_seq_AA_1 <- as.character(unlist(nuc_seq_1_red_translated[1]))

## require stringr package
require(stringr)
total_stop_codons <-str_count(nuc_seq_AA_1, fixed("*"))
total_stop_codons_report <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_1 is this number: ", total_stop_codons)
total_stop_codons_report
#[1] "Count for character stop codons -- * -- in nuc_seq_AA_1 is this number: 115"


## still bad and lots stops codons, remove another one base
nuc_seq_1_red2 <- sub('.', '', nuc_seq_1_red)

## original
nuc_seq_1
#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...

## removed
nuc_seq_1_red2
#[1] "GGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAAC...
     
## create a new string set and translate
nuc_seq_1_red2 <- DNAStringSet(nuc_seq_1_red2)
nuc_seq_1_red2_translated <- Biostrings::translate(nuc_seq_1_red2, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red2_translated
#AAStringSet object of length 1:
#    width seq
#[1]  1628 GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQWLTLSMSDSLTQLN*TEPANLVPAAAVTPLWLKLIS...LTGFLPKWLIAQEMIKQNLTMFTFFILISSLLSLFFYLRLTYTMSLTLAPNSSFSPSSWNFNKKNNLSIPALLTLCLLPISPFFLLFI*NL

## how many stop codons
nuc_seq_AA_2 <- as.character(unlist(nuc_seq_1_red2_translated[1]))

## require stringr package
total_stop_codons2 <-str_count(nuc_seq_AA_2, fixed("*"))
total_stop_codons_report2 <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_2 is this number: ", total_stop_codons2)
total_stop_codons_report2
#[1] "Count for character stop codons -- * -- in nuc_seq_AA_2 is this number: 94"


## still bad and lots stops codons, remove another one base
nuc_seq_1_red3 <- sub('.', '', nuc_seq_1_red2)

## original
nuc_seq_1
#[1] "AAGGTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAA...

## removed
nuc_seq_1_red3
#[1] "GTTTGGTCCTAGCCTTGAAATCAGTTATTATTTAATATACACATGCAAGTCTCCGCACCCCTGTGAAAACG...
         
## create a new string set and translate
nuc_seq_1_red3 <- DNAStringSet(nuc_seq_1_red3)
nuc_seq_1_red3_translated <- Biostrings::translate(nuc_seq_1_red3, genetic.code= SGC1, no.init.codon=FALSE)
nuc_seq_1_red3_translated
#AAStringSet object of length 1:
#    width seq
#[1]  1627 VWS*PWNQLLFNMHMQVSAPLWKRPYSPLQGAGAGI*HKFLPMTPSIATPT*DFSSD*HWA*AMAWLS*TKQSRPIWCQPPRLHHCGSNWFL...H*LAFYPNDLLLKK*LSKISLYLLFLSLFPPF*AYFFIFA*HMLYP*H*HLTHPSPHHPEILTKKMIYLFLPSSLSAFFLFPHFFYYLLET

## how many stop codons
nuc_seq_AA_3 <- as.character(unlist(nuc_seq_1_red3_translated[1]))

## require stringr package
total_stop_codons3 <-str_count(nuc_seq_AA_3, fixed("*"))
total_stop_codons_report3 <- paste0("Count for character stop codons -- * -- in nuc_seq_AA_3 is this number: ", total_stop_codons3)
total_stop_codons_report3
#[1] "Count for character stop codons -- * -- in nuc_seq_AA_3 is this number: 131"

## the best option seems to be nuc_seq_1_red2 with the translated nuc_seq_1_red2_translated

AA_seq_1 <- as.character(unlist(nuc_seq_1_red2_translated))
AA_seq_1 
#[1] "GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQW....

my_aa_translated_Biostrings_set[1] <- AA_seq_1
my_aa_translated_Biostrings_set
#AAStringSet object of length 3:
#    width seq                                                                                                                                                                   names               
#[1]  1628 GLVLALKSVII*YTHASLRTPVKTPLFSPTGS*SRYQAQIFAHNT*HRHTHKGLQQWLTLSMSDSLTQLN*TEPANLVPAA...AQEMIKQNLTMFTFFILISSLLSLFFYLRLTYTMSLTLAPNSSFSPSSWNFNKKNNLSIPALLTLCLLPISPFFLLFI*NL HQ290995.1 Epiped...
#[2]   513 MMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNW...LAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV MW042033.1 Epiped...
#[3]   219 TLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGAPDMAFPRM...VSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLF DQ502853.1 Epiped...

31.6 Finding ORFs in a nucleotide sequence

In most instances, only segments of sequences contain information that can be translated to long string of AA without stop codons (e.g., functional proteins). Those segments are called Open Reading Frames (ORFs) as defined here and here which is the part of a reading frame that has the ability to be translated to protein (it has not stop codons). In general, you most likely want to find the longest ORFs in a sequence, yet not every ORF has the potential to be translated to a functional protein or be an exon of larger gene. Finding ORFs is crucial to discover novel genes, annotate unknown sequences of DNA, find exons among others. However, you can select to find a CoDing Sequence or CDS which is a continuous reading frame that do not need to begin or end with start or stop codons. These CDSs are defined as sequences of nucleotides (e.g., DNA or RNA) that determine a corresponding one od amino acids in a protein. You should consider that an ORFs and CDSs are related, yet ORFs are a continuous stretch of DNA codons that begins with a start codon and ends at a STOP codon.

There are many tools to find such ORFs or CDSs, we will use the R-packages systemPipeR and it has several vignettes for you to explore.

# install and load systemPipeR
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("systemPipeR")

# load library
library(systemPipeR)
library(rentrez)
setwd("~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins")

# we will work with nuclear coding segment for gene like a hemoglobin of a frog
hb_nuc <- "BT081428"
hb_IDs <- entrez_search(db="nuccore", term= hb_nuc)
hb_nuc_fasta <- entrez_fetch(db="nuccore", id=hb_IDs$ids, rettype="fasta")
hb_nuc_fasta
#[1] ">BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, 
# complete cds\nGGGGGGGACTTTACTTGAGTTGCCAAGCTAATCTAACGTGTACTACACCAACTCCCGATCATCACAATGA\nGTCTATCTGCA...
write(hb_nuc_fasta, "Rana_hb_nuc_seqs.fasta")


# define path to file with sequences in fasta

my_nuc_fasta_file_path <- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Rana_hb_nuc_seqs.fasta"

# To store ORF sequences (you need indexed genome .fai file):
my_base_nuc_Biostrings_set <- readDNAStringSet(filepath = my_nuc_fasta_file_path, 
                                         format = "fasta")
my_base_nuc_Biostrings_set
#DNAStringSet object of length 1:
#    width seq                                                                                                                 names               
#[1]   677 GGGGGGGACTTTACTTGAGTTGCCAAGCTAATCTAACGTGTACTACACCAACTCCC...AAATTGTCCATCAAATATAAATCTGATGTACTCTTAAATAAAACTGAAAATAATTT BT081428.1 Rana c...


# get longest CDS with option mode = "CDS"
my_CDSs <- predORF(my_base_nuc_Biostrings_set, 
                      n = 1, 
                   type = "grl", 
                   mode = "CDS", 
                 strand = "sense", 
       longest_disjoint = FALSE,
             startcodon = "ATG",
              stopcodon = c("TAA", "TAG", "TGA")) 
my_CDSs
#GRangesList object of length 1:
#$`BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds`
#GRanges object with 1 range and 2 metadata columns:
#                    seqnames    ranges strand | subject_id inframe2end
#                       <Rle> <IRanges>  <Rle> |  <integer>   <numeric>
#  [1] BT081428.1 Rana cate..    67-495      + |          1           3
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# get CDS sequences 
cds_seqs <- getSeq(my_base_nuc_Biostrings_set, my_CDSs)
my_CDS_long <- cds_seqs[[1]]
my_CDS_long
#DNAStringSet object of length 1:
#    width seq
#[1]   429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCA...GTTCAGGCTGCCTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA

# name of fasta seq

name_seq <- names(cds_seqs)[1]
name_seq <- paste0(">Long_CDS_",name_seq,"\n")

# build vector to write

my_CDS_long_vec <- as.character(unlist(my_CDS_long))
final_my_CDS_long_vec <- paste0(name_seq,my_CDS_long_vec)
cat(final_my_CDS_long_vec)
#>Long_CDS_BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds
#ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGACCAGGCTTTTCCTTAGCTTCCCCCA...

# write fasta file

write(final_my_CDS_long_vec, "Rana_hb_CDS_seqs.fasta")

Another approach is with ORFik and it has several vignettes for you to explore.

# install and load ORFik
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ORFik")

# load library ORFik

library(ORFik)

# define path to file with sequences in fasta

my_nuc_fasta_file_path <- "~/Desktop/Teach_R/my_working_directory/Rana_hb_nuc_seqs.fasta"

# identify ORFs in fasta file
orfs <- findORFsFasta(my_nuc_fasta_file_path)
orfs
#GRanges object with 31 ranges and 0 metadata columns:
#         seqnames    ranges strand
#            <Rle> <IRanges>  <Rle>
#   [1] BT081428.1    67-495      +
#   [2] BT081428.1     20-31      +
#   [3] BT081428.1    77-109      +
#   [4] BT081428.1   119-157      +
#   [5] BT081428.1   206-214      +
#   ...        ...       ...    ...
#  [27] BT081428.1   192-281      -
#  [28] BT081428.1   282-320      -
#  [29] BT081428.1   333-380      -
#  [30] BT081428.1   432-587      -
#  [31] BT081428.1   612-635      -
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# To store ORF sequences (you need indexed genome .fai file):
my_nuc_fasta <- FaFile(my_nuc_fasta_file_path)
orf_seqs <- getSeq(my_nuc_fasta, orfs)
orf_seqs
#DNAStringSet object of length 31:
#     width seq                                                                                                                names               
# [1]   429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGG...CTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA BT081428.1
# [2]    12 TTGCCAAGCTAA                                                                                                       BT081428.1
# [3]    33 CTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAG                                                                                  BT081428.1
# [4]    39 TTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGA                                                                            BT081428.1
# [5]     9 TTGACCTGA                                                                                                          BT081428.1
# ...   ... ...
#[27]    90 TTGGCTGCACCTGCAAGAGCATTAATAATCTTTCCACCATGTTTGTTGAGGTCAGCAGAGCCAGGGGTCAGGTCAAAGTGGGGGAAATAA                         BT081428.1
#[28]    39 CTGAGGGAGGACAGGCTGCCAGCAAGGTCATCCAGATGA                                                                            BT081428.1
#[29]    48 ATGCGAGCCAGCAGAGGAAAGTTTCCTGGATCCACTCTTAGGTTGTAG                                                                   BT081428.1
#[30]   156 TTGGTTGTAACTGAAATGGGCCCAAAGAACACTTCATTGCACGTTGTATTACTTGC...CTTGGAGGTGAGGACTGCAGAAACTAGAGCAAGGAATTTGTCCCAGGCAGCCTGA BT081428.1
#[31]    24 TTGATGGACAATTTACACAAGTAA                                                                                           BT081428.1

# order by the longest to shortest ORF

orf_seqs_ordered <- orf_seqs[order(width(orf_seqs), decreasing = T),]
orf_seqs_ordered 
#DNAStringSet object of length 31:
#     width seq                                                                                                                names               
# [1]   429 ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGG...CTGGGACAAATTCCTTGCTCTAGTTTCTGCAGTCCTCACCTCCAAGTACAGATAA BT081428.1
# [2]   258 ATGCAGGCCACTGAGGGAGGACAGGCTGCCAGCAAGGTCATCCAGATGATTGGCTG...GAGATCCAATCTTTCCCACTATGGAAAGCACAGCAGCCTTCTCACTTGCAGATAG BT081428.1
# [3]   156 TTGGTTGTAACTGAAATGGGCCCAAAGAACACTTCATTGCACGTTGTATTACTTGC...CTTGGAGGTGAGGACTGCAGAAACTAGAGCAAGGAATTTGTCCCAGGCAGCCTGA BT081428.1
# [4]   138 CTGCAGTCCTCACCTCCAAGTACAGATAAATCCACCCTGTATAACAGGCTACATCA...AATGAAGTGTTCTTTGGGCCCATTTCAGTTACAACCAAGTTGCAATGTTTTTTGA BT081428.1
# [5]   126 CTGGCTCTGCTGACCTCAACAAACATGGTGGAAAGATTATTAATGCTCTTGCAGGT...GATGACCTTGCTGGCAGCCTGTCCTCCCTCAGTGGCCTGCATGCCTACAACCTAA BT081428.1
# ...   ... ...
#[27]    12 TTGCCAAGCTAA                                                                                                       BT081428.1
#[28]     9 TTGACCTGA                                                                                                          BT081428.1
#[29]     9 TTGGTGTAG                                                                                                          BT081428.1
#[30]     9 CTGATGTAG                                                                                                          BT081428.1
#[31]     6 TTGTAA                                                                                                             BT081428.1

# name of fasta seq

name_seq_orfik <- names(orf_seqs_ordered)[1]
name_seq_orfik <- paste0(">Long_ORF_orfik_",name_seq_orfik,"\n")

# build vector to write

my_CDS_long_orfik_vec <- as.character(unlist(orf_seqs_ordered[1]))
final_my_CDS_long_orfik_vec <- paste0(name_seq_orfik,my_CDS_long_orfik_vec)
cat(final_my_CDS_long_orfik_vec)
#>Long_ORF_orfik_BT081428.1
#ATGAGTCTATCTGCAAGTGAGAAGGCTGCTGTGCTTTCCATAGTGGGAAAGATTGGATCTCAGGCCAGTGCTCTGGGCTCTGAGGCTTTGACCAGGCTTTTCCTTAGCTTCCCCCAGACCAAGACTTA...

# write fasta file

write(final_my_CDS_long_orfik_vec, "Rana_hb_ORF_orfik_seqs.fasta")

#tools:
#http://csbg.cnb.csic.es/PB/

31.7 Physochemical properties of an amino acid sequence

The physicochemical properties of amino acids are crucial to determine sequence profiles, folding and function. We can get some properties given a continuous amino acid sequence from the R package Peptides and it manual for you to explore particular properties.

# load or install relevant packages

install.packages("seqinr")
install.packages("Peptides")

require(seqinr)
require(Peptides)

## path to fasta files

hb_rana_path <- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Rana_hb_CDS_seqs.fasta"

## read fasta and translate

nucleotide_hb_rana_fasta <- seqinr::read.fasta(file = hb_rana_path, seqtype = "AA")
nucleotide_hb_rana_fasta
#$Long_CDS_BT081428.1
#  [1] "A" "T" "G" "A" "G" "T" "C" "T" "A" "T" "C" "T" "G" "C" "A" "A" "G" "T" "G" "A" "G" "A" "A" "G" "G" "C" "T" "G" "C" "T" "G" "T" "G" "C" "T" "T" "T" "C" "C" "A" "T" "A" "G" "T" "G" "G" "G"
# [48] "A" "A" "A" "G" "A" "T" "T" "G" "G" "A" "T" "C" "T" "C" "A" "G" "G" "C" "C" "A" "G" "T" "G" "C" "T" "C" "T" "G" "G" "G" "C" "T" "C" "T" "G" "A" "G" "G" "C" "T" "T" "T" "G" "A" "C" "C" "A"
# [95] "G" "G" "C" "T" "T" "T" "T" "C" "C" "T" "T" "A" "G" "C" "T" "T" "C" "C" "C" "C" "C" "A" "G" "A" "C" "C" "A" "A" "G" "A" "C" "T" "T" "A" "T" "T" "T" "C" "C" "C" "C" "C" "A" "C" "T" "T" "T"
#[142] "G" "A" "C" "C" "T" "G" "A" "C" "C" "C" "C" "T" "G" "G" "C" "T" "C" "T" "G" "C" "T" "G" "A" "C" "C" "T" "C" "A" "A" "C" "A" "A" "A" "C" "A" "T" "G" "G" "T" "G" "G" "A" "A" "A" "G" "A" "T"
#[189] "T" "A" "T" "T" "A" "A" "T" "G" "C" "T" "C" "T" "T" "G" "C" "A" "G" "G" "T" "G" "C" "A" "G" "C" "C" "A" "A" "T" "C" "A" "T" "C" "T" "G" "G" "A" "T" "G" "A" "C" "C" "T" "T" "G" "C" "T" "G"
#[236] "G" "C" "A" "G" "C" "C" "T" "G" "T" "C" "C" "T" "C" "C" "C" "T" "C" "A" "G" "T" "G" "G" "C" "C" "T" "G" "C" "A" "T" "G" "C" "C" "T" "A" "C" "A" "A" "C" "C" "T" "A" "A" "G" "A" "G" "T" "G"
#[283] "G" "A" "T" "C" "C" "A" "G" "G" "A" "A" "A" "C" "T" "T" "T" "C" "C" "T" "C" "T" "G" "C" "T" "G" "G" "C" "T" "C" "G" "C" "A" "T" "T" "A" "T" "C" "C" "A" "G" "G" "T" "G" "G" "T" "C" "T" "T"
#[330] "G" "G" "C" "T" "A" "C" "T" "C" "A" "C" "T" "T" "C" "C" "C" "T" "G" "G" "T" "G" "A" "T" "T" "T" "C" "A" "C" "T" "G" "C" "T" "G" "A" "G" "G" "T" "T" "C" "A" "G" "G" "C" "T" "G" "C" "C" "T"
#[377] "G" "G" "G" "A" "C" "A" "A" "A" "T" "T" "C" "C" "T" "T" "G" "C" "T" "C" "T" "A" "G" "T" "T" "T" "C" "T" "G" "C" "A" "G" "T" "C" "C" "T" "C" "A" "C" "C" "T" "C" "C" "A" "A" "G" "T" "A" "C"
#[424] "A" "G" "A" "T" "A" "A"
#attr(,"name")
#[1] "Long_CDS_BT081428.1"
#attr(,"Annot")
#[1] ">Long_CDS_BT081428.1 Rana catesbeiana clone rcat-evr-539-243 Hemoglobin subunit alpha-3 putative mRNA, complete cds"
#attr(,"class")
#[1] "SeqFastaAA"

## some squence manipulation

nucleotide_hb_rana_fasta_vec <- paste0(unlist(nucleotide_hb_rana_fasta), collapse ="")
nucleotide_hb_rana_fasta_vec <- tolower(nucleotide_hb_rana_fasta_vec)
nucleotide_hb_rana_fasta_vec <- seqinr::s2c(nucleotide_hb_rana_fasta_vec)
nucleotide_hb_rana_fasta_vec
#  [1] "a" "t" "g" "a" "g" "t" "c" "t" "a" "t" "c" "t" "g" "c" "a" "a" "g" "t" "g" "a" "g" "a" "a" "g" "g" "c" "t" "g" "c" "t" "g" "t" "g" "c" "t" "t" "t" "c" "c" "a" "t" "a" "g" "t" "g" "g" "g"
# [48] "a" "a" "a" "g" "a" "t" "t" "g" "g" "a" "t" "c" "t" "c" "a" "g" "g" "c" "c" "a" "g" "t" "g" "c" "t" "c" "t" "g" "g" "g" "c" "t" "c" "t" "g" "a" "g" "g" "c" "t" "t" "t" "g" "a" "c" "c" "a"
# [95] "g" "g" "c" "t" "t" "t" "t" "c" "c" "t" "t" "a" "g" "c" "t" "t" "c" "c" "c" "c" "c" "a" "g" "a" "c" "c" "a" "a" "g" "a" "c" "t" "t" "a" "t" "t" "t" "c" "c" "c" "c" "c" "a" "c" "t" "t" "t"
#[142] "g" "a" "c" "c" "t" "g" "a" "c" "c" "c" "c" "t" "g" "g" "c" "t" "c" "t" "g" "c" "t" "g" "a" "c" "c" "t" "c" "a" "a" "c" "a" "a" "a" "c" "a" "t" "g" "g" "t" "g" "g" "a" "a" "a" "g" "a" "t"
#[189] "t" "a" "t" "t" "a" "a" "t" "g" "c" "t" "c" "t" "t" "g" "c" "a" "g" "g" "t" "g" "c" "a" "g" "c" "c" "a" "a" "t" "c" "a" "t" "c" "t" "g" "g" "a" "t" "g" "a" "c" "c" "t" "t" "g" "c" "t" "g"
#[236] "g" "c" "a" "g" "c" "c" "t" "g" "t" "c" "c" "t" "c" "c" "c" "t" "c" "a" "g" "t" "g" "g" "c" "c" "t" "g" "c" "a" "t" "g" "c" "c" "t" "a" "c" "a" "a" "c" "c" "t" "a" "a" "g" "a" "g" "t" "g"
#[283] "g" "a" "t" "c" "c" "a" "g" "g" "a" "a" "a" "c" "t" "t" "t" "c" "c" "t" "c" "t" "g" "c" "t" "g" "g" "c" "t" "c" "g" "c" "a" "t" "t" "a" "t" "c" "c" "a" "g" "g" "t" "g" "g" "t" "c" "t" "t"
#[330] "g" "g" "c" "t" "a" "c" "t" "c" "a" "c" "t" "t" "c" "c" "c" "t" "g" "g" "t" "g" "a" "t" "t" "t" "c" "a" "c" "t" "g" "c" "t" "g" "a" "g" "g" "t" "t" "c" "a" "g" "g" "c" "t" "g" "c" "c" "t"
#[377] "g" "g" "g" "a" "c" "a" "a" "a" "t" "t" "c" "c" "t" "t" "g" "c" "t" "c" "t" "a" "g" "t" "t" "t" "c" "t" "g" "c" "a" "g" "t" "c" "c" "t" "c" "a" "c" "c" "t" "c" "c" "a" "a" "g" "t" "a" "c"
#[424] "a" "g" "a" "t" "a" "a"

## tranlation is in the defaul nucleotide codon code: numcode = 1

AA_hb_rana_fasta <- seqinr::translate(seq = nucleotide_hb_rana_fasta_vec, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
AA_hb_rana_fasta
#  [1] "M" "S" "L" "S" "A" "S" "E" "K" "A" "A" "V" "L" "S" "I" "V" "G" "K" "I" "G" "S" "Q" "A" "S" "A" "L" "G" "S" "E" "A" "L" "T" "R" "L" "F" "L" "S" "F" "P" "Q" "T" "K" "T" "Y" "F" "P" "H" "F"
# [48] "D" "L" "T" "P" "G" "S" "A" "D" "L" "N" "K" "H" "G" "G" "K" "I" "I" "N" "A" "L" "A" "G" "A" "A" "N" "H" "L" "D" "D" "L" "A" "G" "S" "L" "S" "S" "L" "S" "G" "L" "H" "A" "Y" "N" "L" "R" "V"
# [95] "D" "P" "G" "N" "F" "P" "L" "L" "A" "R" "I" "I" "Q" "V" "V" "L" "A" "T" "H" "F" "P" "G" "D" "F" "T" "A" "E" "V" "Q" "A" "A" "W" "D" "K" "F" "L" "A" "L" "V" "S" "A" "V" "L" "T" "S" "K" "Y"
#[142] "R" "*"

AA_hb_rana_fasta_string <- seqinr::c2s(AA_hb_rana_fasta)
AA_hb_rana_fasta_string
#[1] "MSLSASEKAAVLSIVGKIGSQASALGSEALTRLFLSFPQTKTYFPHFDLTPGSADLNKHGGKIINALAGAANHLDDLAGSLSSLSGLHAYNLRVDPGNFPLLARIIQVVLATHFPGDFTAEVQAAWDKFLALVSAVLTSKYR*"

## remove stop codons
AA_hb_rana_fasta_string_not_stop <- gsub("*","",AA_hb_rana_fasta_string, fixed = TRUE)
AA_hb_rana_fasta_string_not_stop
#[1] "MSLSASEKAAVLSIVGKIGSQASALGSEALTRLFLSFPQTKTYFPHFDLTPGSADLNKHGGKIINALAGAANHLDDLAGSLSSLSGLHAYNLRVDPGNFPLLARIIQVVLATHFPGDFTAEVQAAWDKFLALVSAVLTSKYR"

## 'Peptides' properties

Peptides::boman(AA_hb_rana_fasta_string_not_stop)
#[1] 0.618662
Peptides::charge(AA_hb_rana_fasta_string_not_stop, pH= 7.8, pKscale = "EMBOSS")
#[1] 1.081336
Peptides::charge(AA_hb_rana_fasta_string_not_stop, pH= 7, pKscale = "EMBOSS")
#[1] 2.182985
Peptides::hydrophobicity(AA_hb_rana_fasta_string_not_stop,"KyteDoolittle")
#[1] 0.2619718
Peptides::pI(AA_hb_rana_fasta_string_not_stop,"EMBOSS")
#[1] 8.974793

For the COX1 subunit that already was encoded as an amino acid sequence.

# load relevant packages

require(seqinr)
require(Peptides)

## path to fasta files

COI_Eanthonyi_path <- "~/Desktop/Teach_R/A_Class_powerpoints/Spring_2024/session_12/working_with_proteins/Eanthonyi_COI_prot_seqs.fasta"

## read fasta and translate only the first sequence

AA_COI_Eanthonyi_fasta <- seqinr::read.fasta(file = COI_Eanthonyi_path, seqtype = "AA", as.string = TRUE)
AA_COI_Eanthonyi_fasta_use <- AA_COI_Eanthonyi_fasta[1]
AA_COI_Eanthonyi_fasta_use
#$QQL01627.1
#[1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASAGVEAGAGTGWTVYPPLAGNLAHAGPSVDLTIFSLHLAGVSSILGAINFITTTLNMKPPSLTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHVVTFYSSKKEPFGYMGMVWAMMSIGLLGFIVWAHHMFTTDLNVDTRAYFTSATMIIAIPTGVKVFSWLATMHGGIIKWDAAMLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMAGFVHWFPLFTGFTLHEAWTKIHFGVMFAGVNLTFFPQHFLGLAGMPRRYSDYPDAYTLWNTVSSVGSLISLVAVIIMMFIIWEAFSSKRLFLNAEMTPTNVEWLYGSPPPYHTFEEAVYSKV"
#attr(,"name")
#[1] "QQL01627.1"
#attr(,"Annot")
#[1] ">QQL01627.1 cytochrome oxidase subunit I, partial (mitochondrion) [Epipedobates anthonyi]"
#attr(,"class")
#[1] "SeqFastaAA"

## some squence manipulation

A_COI_Eanthonyi_fasta_vec <- seqinr::c2s(AA_COI_Eanthonyi_fasta_use)
A_COI_Eanthonyi_fasta_vec
# [1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGA...

## remove stop codons
A_COI_Eanthonyi_fasta_vec_not_stop <- gsub("*","",A_COI_Eanthonyi_fasta_vec, fixed = TRUE)
A_COI_Eanthonyi_fasta_vec_not_stop
# [1] "VMITRWLFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGSLLGDDQIYNVIVTAHAFVMIFFMVMPILIGGFGNWLVPLMIGA...

## 'Peptides' properties
Peptides::boman(A_COI_Eanthonyi_fasta_vec_not_stop)
#[1] -0.4054776
Peptides::charge(A_COI_Eanthonyi_fasta_vec_not_stop, pH= 7.8, pKscale = "EMBOSS")
#[1] -7.372694
Peptides::charge(A_COI_Eanthonyi_fasta_vec_not_stop, pH= 7, pKscale = "EMBOSS")
#[1] -3.690809
Peptides::hydrophobicity(A_COI_Eanthonyi_fasta_vec_not_stop,"KyteDoolittle")
#[1] 0.7395712
Peptides::pI(A_COI_Eanthonyi_fasta_vec_not_stop,"EMBOSS")
#[1] 6.601388