Session 4 – Text Editing and Data Transformations

4.1 Working with character strings

1) A character vector contains text (e.g., nucleotide or amino acid sequence) that is represented by a sequence of characters (e.g., A, T, C, G, U, Met). A character vector can also contain numbers that will be considered as text (i.e., not quantities). For example, we will download sequence of a COX1 mitochondrial gene of a charismatic poison frog Ameerega bilinguis.

We will follow the example from previous sessions using the R-package rentrez.

library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")

# Download some nucleotide sequences from NCBI for Ameerega bilinguis[Organism] that correspond to the gene COI
froggy_name_COX1 <- "Ameerega bilinguis[Organism] AND COI[Gene]"
froggy_seq_IDs <- entrez_search(db="nuccore", term= froggy_name_COX1)
#Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
# Search term (as translated):  "Ameerega bilinguis"[Organism] AND COI[Gene] 

# revising the structure of 'froggy_seq_IDs' that there are 1 sequence in NCBI nuccore database
str(froggy_seq_IDs)
#List of 5
# $ ids             : chr "1952638358"
# $ count           : int 1
# $ retmax          : int 1
# $ QueryTranslation: chr "\"Ameerega bilinguis\"[Organism] AND COI[Gene]"
# $ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
# - attr(*, "class")= chr [1:2] "esearch" "list"
froggy_seqs_fasta <- entrez_fetch(db="nuccore", id=froggy_seq_IDs$ids, rettype="fasta")
froggy_seqs_fasta
#[1] ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG\nCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT\nACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA\nATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT\nTTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG\nCGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC\nCCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT\nTTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC\nTGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG\nACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC\nTGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT\nTGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT\nGGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT\nACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT\nGCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT\nGGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG\nTAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT\nTCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC\nGTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC\nCTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT\nTATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT\nAATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT\n\n"

The froggy_seqs_fasta object is a character vector as indicated by the sequence of characters (i.e., letters and numbers) contained between quotes " ". Likewise, we can identify that this vector contains a sequence in a fasta format. We will dissect this character vector.

# 1) A fasta header starting with '>', the NCBI accession number 'MW042030.1'
# >MW042030.1
# 2) A defition as is present in the GenBank format for this sequence 'Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial'
# Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 3) A new line coding indicated by '\n'
# \n
# 4) The actual nucleotide sequence in capital letters with some new line characters '\n' interspersed in the sequence
# GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG\nCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT\nACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA\nATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT\nTTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG\nCGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC\nCCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT\nTTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC\nTGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG\nACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC\nTGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT\nTGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT\nGGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT\nACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT\nGCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT\nGGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG\nTAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT\nTCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC\nGTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC\nCTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT\nTATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT\nAATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT
# 5) Final a couple of new line '\n' at the end of the vector
# \n\n

2) We can explore the character vector froggy_seqs_fasta with several functions.

We can check the length of this vector (i.e., how many elements this vector contains) with the function length(). It should just 1 element, the COX1 of Ameerega bilinguis.

length(froggy_seqs_fasta)
#[1] 1

We can explore other properties of this vector using the functions class() and str().

class(froggy_seqs_fasta)
#[1] "character"
str(froggy_seqs_fasta)
#chr ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochon"| __truncated__

We can also test if this vector is character vector using the function is.character(). The result is a logical TRUE or FALSE. We expect to be TRUE.

is.character(froggy_seqs_fasta) 
#[1] TRUE

We can also determine the actual number characters in this vector (i.e., the total number of letters, symbols, spaces, numbers) using the function nchar().

nchar(froggy_seqs_fasta)
#[1] 1679

3) We see that the vector froggy_seqs_fasta has other text than the actual nucleotide sequence. We will extract this sequence using a combination of functions that work with character vectors.

First, we can split this vector by the new line ‘’ using the function strsplit() and test what type of data structure we obtain using class().

froggy_seq_split <- strsplit(froggy_seqs_fasta, split = '\n')
froggy_seq_split
#[[1]]
# [1] ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial"
# [2] "GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG"                                              
# [3] "CATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT"                                              
# [4] "ACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA"                                              
# [5] "ATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT"                                              
# [6] "TTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG"                                              
# [7] "CGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC"                                              
# [8] "CCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT"                                              
# [9] "TTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC"                                              
#[10] "TGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG"                                              
#[11] "ACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC"                                              
#[12] "TGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT"                                              
#[13] "TGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT"                                              
#[14] "GGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT"                                              
#[15] "ACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT"                                              
#[16] "GCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT"                                              
#[17] "GGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG"                                              
#[18] "TAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT"                                              
#[19] "TCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC"                                              
#[20] "GTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC"                                              
#[21] "CTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT"                                              
#[22] "TATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT"                                              
#[23] "AATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT"                                               
#[24] ""                                                                                                                    
class(froggy_seq_split)
#[1] "list"

As you can see, the object is a list and its first element is a vector with 24 elements. The sequence of COX1 is contained in 22 elements with indices between [2] and [23]. We can paste all of these in a single character vector.

For this purpose, we will use the function paste() or its variant paste0(). Notice how I indexed the elements that I want it to be collapsed in a single character vector using the argument collapse = "" (i.e., "" indicates nothing). The [[1]] indicates that we want the first element of a list and the [2:23] indicates that we want the vector elements starting from [2] to [23] (i.e., 2, 3, 4, 5, …, 23).

froggy_one_fasta <- paste(froggy_seq_split[[1]][2:23], collapse = "")
froggy_one_fasta
#[1] "GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAG..."

We can also name this vector with the corresponding accession number using the function names().

names(froggy_one_fasta) <- froggy_seq_split[[1]][1]
froggy_one_fasta
# >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATT..."

4) We can make this vector more interesting by adding more COX1 sequences from three species with a similar procedure.

library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")

# Download more COX1 nucleotide sequences from NCBI for 3 more species (i.e., three terms)
several_sepecies_COX1 <- c("Ameerega parvula[Organism] AND COI[Gene]", 
                           "Epipedobates anthonyi[Organism] AND COI[Gene]",
                           "Oophaga sylvatica[Organism] AND COI[Gene]")

We can only retrieve one term a time, we will use a loop to iterate between these three terms and save in an empty list my_collected_sequences. More on loops later.

# we create an empty list to collect processed sequences

my_collected_sequences <- list()

# the loop to retrieve sequences

for(i in 1:length(several_sepecies_COX1)) {
                             one_seq_IDs <- entrez_search(db="nuccore", term= several_sepecies_COX1[i])
                          one_seqs_fasta <- entrez_fetch(db="nuccore", id=one_seq_IDs$ids, rettype="fasta")
             my_collected_sequences[[i]] <- one_seqs_fasta
                                          }

# we have a list 'my_collected_sequences' of 3 elements; one per species.
length(my_collected_sequences)
#[1] 3

5) We need to process these retrieved sequences to the format that only contains the nucleotides and the GenBank accession numbers. We will start with the my_collected_sequences[[2]] that has two concatenated sequences and the create a function to process this automatically.

Notice that my_collected_sequences[[2]] has two fasta sequences named MW042033.1 and DQ502853.1.

my_collected_sequences[[2]]
#[1] ">MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAA
#TTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGG\nCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCC
#TT\nACTGGGCGATGACCAGATTTATAATGTAATCGTAACTGCCCATGCTTTCGTCATAATCTTCTTCATGGTT\nATGCCTATTCTAATTGGTGGATTTGGAAATTGACTGGTACCTTTAATAATTGGAGCC
#CCAGACATAGCCT\nTTCCCCGAATAAATAATATAAGTTTTTGACTTCTTCCGCCCTCATTCCTTCTTCTCTTAGCATCTGCAGG\nAGTAGAAGCAGGAGCTGGGACAGGTTGAACTGTTTATCCTCCTCTT
#GCTGGTAACCTAGCTCATGCAGGC\nCCATCAGTTGATTTAACCATTTTTTCACTCCATCTTGCAGGGGTCTCTTCAATTCTAGGAGCAATTAATT\nTTATTACAACCACATTAAACATAAAACCTCCTTCT
#TTAACTCAATACCAAACTCCTTTATTTGTCTGATC\nAGTTCTAATCACCGCAGTACTTCTTCTTCTTTCTCTTCCTGTCCTAGCTGCAGGAATTACAATGCTTTTA\nACTGATCGAAACCTTAACACCACT
#TTCTTTGACCCAGCAGGTGGAGGTGACCCCATTCTTTATCAACATC\nTCTTCTGATTTTTTGGACACCCAGAAGTTTATATTCTTATTCTCCCAGGGTTTGGAATTATCTCTCATGT\nTGTAACTTTTTAT
#TCTAGCAAAAAAGAACCTTTTGGATATATAGGAATGGTCTGAGCTATAATATCAATT\nGGCCTCTTAGGTTTTATTGTTTGAGCTCACCATATATTTACTACTGATCTTAACGTTGACACTCGAGCTT\nAC
#TTTACCTCAGCTACTATAATTATTGCCATCCCAACAGGCGTAAAAGTCTTTAGCTGACTTGCCACCAT\nGCATGGAGGAATTATTAAATGAGATGCTGCTATACTTTGAGCCTTAGGCTTTATTTTCTTGTT
#TACTGTT\nGGTGGACTAACTGGTATCGTTTTAGCTAATTCCTCTTTGGATATTGTTCTTCATGATACCTACTACGTTG\nTAGCCCACTTCCATTATGTTCTTTCTATAGGAGCAGTATTTGCCATTATAGC
#AGGATTTGTTCATTGATT\nTCCCTTATTTACCGGTTTTACTCTTCATGAAGCCTGAACAAAGATTCATTTTGGCGTCATATTTGCTGGT\nGTAAATTTAACTTTCTTCCCCCAACATTTTTTAGGATTAGC
#AGGAATACCTCGACGCTATTCAGATTATC\nCTGATGCCTACACCTTATGAAACACCGTTTCTTCTGTTGGCTCCCTAATCTCCCTTGTTGCAGTAATCAT\nCATGATATTCATTATTTGAGAAGCTTTTTC
#ATCTAAACGTCTATTTTTGAATGCAGAAATAACCCCTACT\nAATGTTGAATGATTATATGGCTCCCCTCCTCCTTATCACACATTTGAGGAAGCTGTTTATTCTAAAGTA\n\n
#>DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTATTT
#GGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCA\nGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAATCGTAACTGCCCATGCTT\nTCGTCATAA
#TCTTCTTCATGGTTATGCCTATTCTAATTGGTGGATTTGGAAATTGACTGGTACCTTTAAT\nAATTGGAGCCCCAGACATAGCCTTTCCCCGAATAAATAATATAAGTTTTTGACTTCTTCCGCCCTCATTC\
#nCTTCTTCTCTTAGCATCTACAGGAGTAGAAGCAGGAGCTGGGACAGGTTGAACTGTTTATCCTCCTCTTG\nCTGGTAACCTAGCTCATGCAGGCCCATCAGTTGATTTAACCATTTTTTCACTCCATCTTGC
#AGGGGTCTC\nTTCAATTCTAGGAGCAATTAATTTTATTACAACCACATTAAACATAAAACCTCCTTCTTTAACTCAATAC\nCAAACTCCTTTATTTGTCTGATCAGTTCTAATCACCGCAGTACTTCTTCTT
#CTTTCTCTTCCTGTCCTAG\nCTGCAGGAATTACAATGCTTTTAACTGATCGAAACCTTAACACCACTTTCTTTGACCCAGCAGGTGGAGG\nTGACCCCATTCTTTATCAACATCTCTTC\n\n"

We will split this long character vector by the unique character that defines a new fasta entry which is indicated by the >.

one_species_all_fasta <- my_collected_sequences[[2]]
one_species_split_by_fasta <- strsplit(one_species_all_fasta, split = '>')
one_species_split_by_fasta
#[[1]]

#[2] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[3] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...

We need to remove these character vectors from the list using the function unlist().

one_species_split_by_fasta_vector <- unlist(one_species_split_by_fasta)
one_species_split_by_fasta_vector

#[2] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[3] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...

We can also remove the empty element [1] "" with the application of the logical condition x[x != ""] where x is our vector.

one_species_split_by_fasta_vector <- one_species_split_by_fasta_vector[one_species_split_by_fasta_vector != ""]
one_species_split_by_fasta_vector
#[1] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[2] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...

We can now process each of these element as we did on step 3), but using a loop.

# we create an empty list to collect processed sequences

one_species_processed_seq_list <- list()

# the loop to process sequences

for(i in 1:length(one_species_split_by_fasta_vector)) {
    # split and unlist
                 one_species_clean_data <- unlist(strsplit(one_species_split_by_fasta_vector[i], split = '\n'))
    # remove empty element ""
                 one_species_clean_data <- one_species_clean_data[one_species_clean_data != ""]
    # we know the first element will be always the accession number and the rest sequence but the length migh vary so we use length() to determine that.
                   one_species_only_seq <- paste(one_species_clean_data[2:length(one_species_clean_data)], collapse = "") 
    # we name that sequence with the corresponding accession number on the first element of 'one_species_clean_data'.
            names(one_species_only_seq) <- one_species_clean_data[1]
    # finally, collected the processed sequences
    one_species_processed_seq_list[[i]] <- one_species_only_seq
                                        }

# we the unlist this final output for species to a vector
one_species_end <- unlist(one_species_processed_seq_list)
one_species_end
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."

However, this was just applied to a single species, but we have three. Now, we will generalize this process into single process that show the beauty of automatization with R.

6) A generalization to create this sequence vectors with all previous steps together will use a more complex loop with two nested indices j and i. Yet, you are welcome to explore alternatives.

list_final_fasta <- list()

for(j in 1:length(my_collected_sequences)) {
    # j <- 3
one_species_all_fasta <- my_collected_sequences[[j]]
one_species_split_by_fasta <- strsplit(one_species_all_fasta, split = '>')
one_species_split_by_fasta_vector <- unlist(one_species_split_by_fasta)
one_species_split_by_fasta_vector <- one_species_split_by_fasta_vector[one_species_split_by_fasta_vector != ""]

one_species_processed_seq_list <- list()

for(i in 1:length(one_species_split_by_fasta_vector)) {
                 one_species_clean_data <- unlist(strsplit(one_species_split_by_fasta_vector[i], split = '\n'))
                 one_species_clean_data <- one_species_clean_data[one_species_clean_data != ""]
                   one_species_only_seq <- paste(one_species_clean_data[2:length(one_species_clean_data)], collapse = "") 
            names(one_species_only_seq) <- one_species_clean_data[1]
    one_species_processed_seq_list[[i]] <- one_species_only_seq
                                        }

list_final_fasta[[j]] <- unlist(one_species_processed_seq_list)
                                            }

final_fasta_vector <- unlist(list_final_fasta)
final_fasta_vector
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."
#...

7) We can append to the top the first processed sequence on step 3) to this long vector.

final_fasta_vector <- c(froggy_one_fasta, final_fasta_vector)
final_fasta_vector
# >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATT..." 
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."
#...

8) We will explore some properties of the resulting character vector.

We can get how many elements (sequences) we have in the vector with the function length(). It has 24 sequences.

length(final_fasta_vector)
[1] 24

We can get the number of characters (i.e., letters or nucleotides) on each sequence the function nchar().

nchar(final_fasta_vector)
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#                                                                                                                  1539 
#     MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#                                                                                                                  1539 
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#                                                                                                                  1539 
#     DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#                                                                                                                   658 
#         MH589864.1 Oophaga sylvatica isolate CH2 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#                                                                                                                   523 
#...

We can get change the case characters using the function tolower() that converts any upper case characters into lower case. The opposite procedure can be achieved with the function toupper() that converts any lower case characters into upper case.

tolower(final_fasta_vector)[1]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"gtgataattactcgatgattattttctaccaaccacaaagacatcggaactttatacctagtgtttggggcatgagcaggcatagtcggcactgctcttagcctttt..."
toupper(final_fasta_vector)[1]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTA..."

9) We can split the character vector into chucks of a defined length using following function.

We want to split the vector into individual nucleotides.

one_sequence <- final_fasta_vector[1]
names(one_sequence) <- NULL
one_sequence_nucleotides <- unlist(strsplit(one_sequence, split = ""))
one_sequence_nucleotides
#  [1] "G" "T" "G" "A" "T" "A" "A" "T" "T" "A" "C" "T" "C" "G" "A" "T" "G" "A" "T" "T" "A" "T" "T" "T" "T" "C" "T" "A" "C" "C" "A" "A"
# [33] "C" "C" "A" "C" "A" "A" "A" "G" "A" "C" "A" "T" "C" "G" "G" "A" "A" "C" "T" "T" "T" "A" "T" "A" "C" "C" "T" "A" "G" "T" "G" "T"
# [65] "T" "T" "G" "G" "G" "G" "C" "A" "T" "G" "A" "G" "C" "A" "G" "G" "C" "A" "T" "A" "G" "T" "C" "G" "G" "C" "A" "C" "T" "G" "C" "T"
#...

We can also split one continuous sequence into codons using a the function sapply(). This is a very useful function to process a function into recursively loop applied to a vector.

one_sequence <- final_fasta_vector[1]
names(one_sequence) <- NULL
one_sequence_codons <- sapply(seq(from=1, to=nchar(one_sequence), by=3), 
                                   function(i) substr(one_sequence, i, i+2))
one_sequence_codons
# [1] "GTG" "ATA" "ATT" "ACT" "CGA" "TGA" "TTA" "TTT" "TCT" "ACC" "AAC" "CAC" "AAA" "GAC" "ATC" "GGA" "ACT" "TTA" "TAC" "CTA" "GTG"
#[22] "TTT" "GGG" "GCA" "TGA" "GCA" "GGC" "ATA" "GTC" "GGC" "ACT" "GCT" "CTT" "AGC" "CTT" "TTA" "ATT" "CGA" "GCC" "GAA" "TTA" "AGC"
#[43] "CAG" "CCC" "GGG" "TCC" "TTA" "CTA" "GGC" "GAT" "GAC" "CAG" "ATC" "TAC" "AAC" "GTT" "ATT" "GTT" "ACC" "GCC" "CAT" "GCT" "TTC"
#...

10) We can transform some characters and replace between characters using the function chartr(). We can try to replace thymine (T) with uracil (U) for DNA to RNA sequences.

chartr("T", "U", final_fasta_vector[1])
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"GUGAUAAUUACUCGAUGAUUAUUUUCUACCAACCACAAAGACAUCGGAACUUUAUACCUAGUGUUUGGGGCAUGAGCAGGCAUAGUCGGCACUGCUCUUAGCCUUUUAAUUCG..."

4.2 Regular expressions (Regex)

Finding patterns of characters in strings can be done in many different ways, some are extremely useful, and many cases are tedious. In general, they follow a similar procedure that starts with finding a pattern and indexing it, replacing this pattern with other string, testing for the presence/absence of the pattern, and extracting characters nearby or between such known patterns. However, they can be hard to implement, and I will present some few applications that I find useful for bioinformatics while working with sequences.

1) Finding patterns with grep() and grepl(). These functions will search for matches to an argument pattern within each element of a character vector.

We can try to find if any of our sequences have a particular set of characters (i.e., nucleotide sequences) and report which element has them within the final_fasta_vector.

The result below indicate that only sequence with index [1] and [2] have such pattern.

my_pattern <- "ACTGGCTGAACTGTGTACCC"
what_sequences <- grep(pattern = my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE, invert = FALSE)
what_sequences
#[1] 1 2
final_fasta_vector[what_sequences]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAA..." 
#MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCCACCAACCATAAAGATATTGGAACCTTATACCTAGTGTTTGGGGCGTGAGCGGGCATAGTCGGCACTGCTCTTAGCCTCTTAATTCGAGCCGAG..."

A similar result can be obtained with grepl(), but it will be a logical vector and only those elements that match will have a TRUE value and are returned when what_sequences is used to index the final_fasta_vector.

my_pattern <- "ACTGGCTGAACTGTGTACCC"
what_sequences <- grepl(pattern = my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
#[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
final_fasta_vector[what_sequences]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAA..." 
#MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCCACCAACCATAAAGATATTGGAACCTTATACCTAGTGTTTGGGGCGTGAGCGGGCATAGTCGGCACTGCTCTTAGCCTCTTAATTCGAGCCGAG..."

2) Replacing patterns with sub() and gsub(). These functions will search for matches to an argument pattern and replace them with a text that you must provide. The sub() will replace the first occurrence leaving the others as they are. On the other hand, the gsub() will replace all the strings or values with the input pattern.

my_pattern <- "CAAACAAA"
my_replacement <-"!!!found_it!!!"
fasta_vector_replacements <- sub(pattern = my_pattern,
                             replacement = my_replacement,
                            final_fasta_vector)
fasta_vector_replacements
#...
#DQ502853.1Epipedobatesanthonyiisolate838cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACT..."
#MH589864.1OophagasylvaticaisolateCH2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589863.1OophagasylvaticaisolateCH1cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCTCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589862.1OophagasylvaticaisolateMC2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGGTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGTTACATAGGAAT..."
#...

# using gsub

fasta_vector_replacements <- gsub(pattern = my_pattern,
                              replacement = my_replacement,
                            final_fasta_vector)
fasta_vector_replacements
#...
#DQ502853.1Epipedobatesanthonyiisolate838cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACT..."
#MH589864.1OophagasylvaticaisolateCH2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589863.1OophagasylvaticaisolateCH1cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCTCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589862.1OophagasylvaticaisolateMC2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGGTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGTTACATAGGAAT..."
#...

3) We can match multiple patterns using the | that signifies or. We will use grepl() to exemplify its use.

Here we are testing if it matches any of two patterns. These might not need to be adjacent (i.e., there are some chucks of sequence between them).

my_pattern <- c("TCATCTCCCATGTA", "GTGATAATTACTCGATGATTATTTT")
my_pattern <- paste0(my_pattern, collapse = "|")
my_pattern
#[1] "TCATCTCCCATGTA|GTGATAATTACTCGATGATTATTTT"
what_sequences <- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
#[[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[23]  TRUE  TRUE

Here we are testing if the vector elements match both patterns at the same time. These are not necessarily adjacent (i.e., there could be a chuck sequence between them). We can use wildcard character . (i.e., it represents any character except \n, which is a new line) and * is a greedy character (i.e., the asterisk will always match the longest possible string).

my_pattern <- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0(my_pattern, collapse = ".*")
my_pattern
#[1] "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"
what_sequences <- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE

Notice that we can overrun the wildcard nature of these characters by changing the argument fixed = TRUE. This means a literal match to .* in the sequence.

my_pattern <- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0(my_pattern, collapse = ".*")
my_pattern
#[1] "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"
what_sequences <- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = TRUE)

# This result in all cases as FALSE because we fixed to match EXACTLY the text pattern "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"

what_sequences
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE

4) We can use some characters to match only numbers, only lower case letters or special symbols. These can be used to find or replace text after or before these special characters.

To illustrate some of their use we will construct a vector with random text. We will use the function cat() to concatenate and print this vector.

my_random_vector <- c("text_file.txt", "text_file.csv", "text_file2.txt", "multiple\ndots\nin\nthis.line", "multiple\tdots\tin\tthis.line", "multiple.dots.in.this.line", "CaPiTaL_123_NYC")
my_random_vector
#[1] "text_file.txt"                 "text_file.csv"                 "text_file2.txt"                "multiple\ndots\nin\nthis.line"
#[5] "multiple\tdots\tin\tthis.line" "multiple.dots.in.this.line"    "CaPiTaL_123_NYC"              
cat(my_random_vector)
#text_file.txt text_file.csv text_file2.txt multiple
#dots
#in
#this.line multiple dots    in  this.line multiple.dots.in.this.line CaPiTaL_123_NYC

We can use gsub() to replace both \n (new line) and \t (tabs) using | that means an or condition with a - as a replacement. Compare output before and after the replacement.

my_vector_processed <- gsub(pattern = "\n|\t", replacement = "-", my_random_vector)
my_random_vector[4:5]
#[1] "multiple\ndots\nin\nthis.line" "multiple\tdots\tin\tthis.line"
my_vector_processed[4:5]
#[1] "multiple-dots-in-this.line" "multiple-dots-in-this.line"

We can use gsub() to replace all digits (i.e., 0 to 9) using [[:digit:]] with - replacement. Compare output before and after the replacement.

my_vector_processed <- gsub(pattern = "[[:digit:]]", replacement = "-", my_random_vector)
my_random_vector[c(3,7)]
#[1] "text_file2.txt"  "CaPiTaL_123_NYC"
my_vector_processed[c(3,7)]
#[1] "text_file-.txt"  "CaPiTaL_---_NYC"

We can use gsub() to replace all non-digits using \\D or [^[:digit:]] with - replacement. Notice the ^ that is used to negate (i.e., the opposite of digits, will be non-digits). Compare output before and after the replacement.

my_vector_processed <- gsub(pattern = "\\D", replacement = "-", my_random_vector)
my_random_vector
#[1] "text_file.txt"                 "text_file.csv"                 "text_file2.txt"                "multiple\ndots\nin\nthis.line"
#[5] "multiple\tdots\tin\tthis.line" "multiple.dots.in.this.line"    "CaPiTaL_123_NYC"              
my_vector_processed
#[1] "-------------"              "-------------"              "---------2----"             "--------------------------"
#[5] "--------------------------" "--------------------------" "--------123----"
my_vector_processed <- gsub(pattern = "[^[:digit:]]", replacement = "-", my_random_vector)
my_vector_processed
#[1] "-------------"              "-------------"              "---------2----"             "--------------------------"
#[5] "--------------------------" "--------------------------" "--------123----"

We can use gsub() to replace all periods using[.] with spaces using or with nothing as "". Compare output before and after the replacement.

my_vector_processed <- gsub(pattern = "[.]", replacement = " ", my_random_vector)
my_random_vector[6]
#[1] "multiple.dots.in.this.line"
my_vector_processed[6]
#[1] "multiple dots in this line"
my_vector_processed <- gsub(pattern = "[.]", replacement = "", my_random_vector)
my_vector_processed[6]
#[1] "multipledotsinthisline"

5) We can use some anchor characters ^ (for START of the string) and $ (for END of the string). These are especially characters if you want to replace or find a pattern like the extension of a file.

These strings have the text pattern at the beginning.

these_strings <- grepl(pattern = "^text", my_random_vector, ignore.case = FALSE, fixed = FALSE)
my_random_vector[these_strings]
#[1] "text_file.txt"  "text_file.csv"  "text_file2.txt"

We can get the file names that end with .txt pattern.

these_are_txt_files <- grepl(pattern = "*.txt$", my_random_vector, ignore.case = FALSE, fixed = FALSE)
my_random_vector[these_are_txt_files]
#[1] "text_file.txt"  "text_file2.txt"

6) An interesting application of regular expressions is to extract text (e.g., nucleotide sequence) between two patterns. This is a powerful demonstration of grepl() and gsub().

my_pattern <- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0("^.*", my_pattern[1], "\\s*|\\s*", my_pattern[2], ".*$")
my_pattern
#[1] "^.*GTGATAATTACTCGATGATTATTTTCTA\\s*|\\s*ACTTTATACCTAGTGTTTGGGGCA.*$"
is_the_pattern_present <- grepl(pattern = my_pattern, final_fasta_vector)
is_the_pattern_present
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
sequences_that_have_pattern <- final_fasta_vector[is_the_pattern_present]
my_extracted_sequences <- gsub(pattern = my_pattern, replacement = '', sequences_that_have_pattern)
my_extracted_sequences
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial 
#"CCAACCACAAAGACATCGGA"

4.3 Comparing, ordering and simplifying vectors

1) There are several operations that can be applied to vectors that can help you manage the information contained in them. They are applied to one or two vectors. We will illustrate some examples.

We will build two relatively similar vectors.

my_vector_1 <- c("Allobates", "Epipedobates", "Dendrobates", "Hyloxalus", "Mannophryne", "Hyloxalus", "Silverstoneia")
my_vector_2 <- c("Epipedobates", "Epipedobates","Dendrobates", "Hyloxalus", "Mannophryne", "Hyloxalus", "Rheobates")

We can remove duplicated entries using the function unique().

unique(my_vector_1)
#[1] "Allobates"     "Epipedobates"  "Dendrobates"   "Hyloxalus"     "Mannophryne"   "Silverstoneia"
unique(my_vector_2)
#[1] "Epipedobates" "Dendrobates"  "Hyloxalus"    "Mannophryne"  "Rheobates"

2) The union of elements of two vectors is performed with union(). This function will result in a new vector that contain elements from both vectors even if they are not in common but discards any duplicated values.

union(my_vector_1, my_vector_2)
#[1] "Allobates"     "Epipedobates"  "Dendrobates"   "Hyloxalus"     "Mannophryne"   "Silverstoneia" "Rheobates"

3) The intersection of elements of two vectors is determined with intersect(). This function will result in a new vector that contain elements from both vectors, and these are elements in common, but discards any duplicated values.

intersect(my_vector_1, my_vector_2)
#[1] "Epipedobates" "Dendrobates"  "Hyloxalus"    "Mannophryne"

4) The function setdiff() results in a vector of elements that has those that are not in common between the two vectors. However, it depends on vector that corresponds to x on the function as setdiff(x, y).

This will return elements unique to my_vector_1 and not in common with my_vector_2.

setdiff(my_vector_1, my_vector_2)
#[1] "Allobates"     "Silverstoneia"

This will return elements unique to my_vector_2 and not in common with my_vector_1.

setdiff(my_vector_2, my_vector_1)
#[1] "Rheobates"

5) The function setequal() results in a logical test (i.e., TRUE or FALSE) that determines if these two vector contain the same elements.

setequal(my_vector_2, my_vector_1)
#[1] FALSE

6) The function is.element() results in logical test (i.e., TRUE or FALSE) that determines if an element is present in a vector.

is.element(c("Allobates", "Rheobates"), my_vector_1)
#[[1]  TRUE FALSE
is.element(c("Allobates", "Rheobates"), my_vector_2)
#[1] FALSE  TRUE

A similar, but more compact, form is the function %in%.

c("Allobates", "Rheobates") %in% my_vector_1
#[[1]  TRUE FALSE
c("Allobates", "Rheobates") %in% my_vector_2
#[1] FALSE  TRUE

7) We can organize the elements in a vector alphabetically or numerically using the function sort(). This ordering is by default increasing, but you can order in decreasing fashion by changing the argument decreasing = TRUE.

sort(my_vector_1)
#[1] "Allobates"     "Dendrobates"   "Epipedobates"  "Hyloxalus"     "Hyloxalus"     "Mannophryne"   "Silverstoneia"
sort(my_vector_1, decreasing = TRUE)
#[1] "Silverstoneia" "Mannophryne"   "Hyloxalus"     "Hyloxalus"     "Epipedobates"  "Dendrobates"   "Allobates"

8) We can concatenate two vectors with the function c(). We can also remove duplicated entries and order the final vector with the functions unique() and sort().

my_vector_1_2 <- sort(unique(c(my_vector_1,my_vector_2)))
my_vector_1_2
#[1] "Allobates"     "Dendrobates"   "Epipedobates"  "Hyloxalus"     "Mannophryne"   "Rheobates"     "Silverstoneia"

4.4 Useful functions of `stringr`

9) There are several specialized R-packages that deal with strings and more or less simplify what we have demonstrated above. However, it forces you to be dependent on loading these packages in the R environment before you can use them.

One of the most used is stringr that is part of the ‘tidyverse’ family. This package has an extensive manual and some functions can make some of the processing of string or character vectors easier.

# to install R-package stringr
install.packages("stringr")
library(stringr)

One useful function is str_count() that can count the number of matches of a pattern in a string.

We can construct a data frame with the total counts of each nucleotide bases.

adenine   <- str_count(final_fasta_vector, pattern = "A")
cytocine  <- str_count(final_fasta_vector, pattern = "C")
guanine   <- str_count(final_fasta_vector, pattern = "G")
thymidine <- str_count(final_fasta_vector, pattern = "T")
nucleotide_numbers <- data.frame(n_adenine = adenine,
                                 n_cytocine = cytocine,
                                 n_guanine = guanine,
                                 n_thymidine = thymidine,
                                 genbank_accession = names(final_fasta_vector),
                                 stringsAsFactors = FALSE)
head(nucleotide_numbers)
#  n_adenine n_cytocine n_guanine n_thymidine
#1       384        410       256         489
#2       369        399       273         498
#3       395        340       250         554
#4       169        161       108         220
#5       136        116        94         177
#6       137        117        92         177
#                                                                                                       genbank_accession
#1   >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#2      MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#3 MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#4       DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#5           MH589864.1 Oophaga sylvatica isolate CH2 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#6           MH589863.1 Oophaga sylvatica isolate CH1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial

Another useful function is str_locate_all() that can find where a pattern is located in the vector elements (i.e., the position of the characters that match the pattern).

We can construct a data frame with the count number of each nucleotide bases.

restriction_sites_to_cut_with_BFAL <- str_locate_all(final_fasta_vector, pattern = "CTAG")
names(restriction_sites_to_cut_with_BFAL) <- names(final_fasta_vector)
restriction_sites_to_cut_with_BFAL[1:2]
#`>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial`
#     start  end
#[1,]    58   61
#[2,]   142  145
#[3,]   337  340
#[4,]   406  409
#[5,]   475  478
#[6,]   785  788
#[7,]   847  850
#[8,]  1384 1387
#$`MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial`
#     start  end
#[1,]    58   61
#[2,]   475  478
#[3,]   785  788
#[4,]   847  850
#[5,]  1024 1027
#[6,]  1072 1075
#[7,]  1384 1387

4.5 Data transformations

An important application of R in data analyses is transformations (e.g., one data structure into another or merging data frames).

1) Transforming vectors into other data structures (e.g., list or data frame) are usually straightforward.

From vector to list using the function as.list().

my_vector <- 1:10
my_vector
#[1]  1  2  3  4  5  6  7  8  9 10
my_list <- as.list(my_vector)
my_list

From list to vector using the function unlist().

unlist(my_list)
#[1]  1  2  3  4  5  6  7  8  9 10

From vector to a data frame using the function as.data.frame().

my_data_frame <- as.data.frame(my_vector, stringsAsFactors = FALSE)
head(my_data_frame)
#  my_vector
#1         1
#2         2
#3         3
#4         4
#5         5
#6         6

2) Apply a function as a loop to each element of a list using the function lapply().

# create a list of two vectors of numbers
my_list2 <- list(odd_numbers = c(1,3,5,7,9),
               prime_numbers = c(1,2,3,5,7,11,13,17))
my_list2
#$odd_numbers
#[1] 1 3 5 7 9
#$prime_numbers
#[1]  1  2  3  5  7 11 13 17

Then, we can apply the function mean() to both elements (the two numbers vectors) in my_list2 using the loop function lapply() that it is used for lists.

lapply(my_list2, mean)
#$odd_numbers
#[1] 5
#$prime_numbers
#[1] 7.375

3) Apply a function to rows or columns of a data frame or matrix using the function apply().

We can create a data frame of numbers in columns.

my_data_frame <- data.frame(A = c(1,2,NA,3,4), 
                            B = c(NA,4,1,67,-2), 
                            C = c(-2,3,1,99,NA), 
                            D = c(78,56,9,3,NA), stringsAsFactors = FALSE)

Then, we can add the name of the with the function rownames() to each row of the data frame.

rownames(my_data_frame) <- c("row_1", "row_2", "row_3", "row_4", "row_5")
my_data_frame
#         A  B  C  D
#  row_1  1 NA -2 78
#  row_2  2  4  3 56
#  row_3 NA  1  1  9
#  row_4  3 67 99  3
#  row_5  4 -2 NA NA

Now we can apply a function max() to every row. Notice the 1 in apply() that denotes that it will get the maximum value per each individual row.

names_columns <- names(my_data_frame)
max_values <- apply(my_data_frame,1,max)
names(max_values) <- rownames(my_data_frame)
max_values
#row_1 row_2 row_3 row_4 row_5 
#   NA    56    NA    99    NA

Notice that some values were NA (no data), to remove those NA while applying the function max(), we need the argument na.rm = TRUE.

max_values_no_NA <- apply(my_data_frame,1,max, na.rm = TRUE)
names(max_values_no_NA) <- rownames(my_data_frame)
max_values_no_NA
#row_1 row_2 row_3 row_4 row_5 
#   78    56     9    99     4

To apply a function max() to every column, we will use 2 in apply().

max_values_columns_no_NA <- apply(my_data_frame,2,max, na.rm = TRUE)
names(max_values_columns_no_NA) <- names(my_data_frame)
max_values_columns_no_NA
# A  B  C  D 
# 4 67 99 78

4) Here are some build-in and useful functions that can be applied to rows and columns.

colSums (my_data_frame, na.rm = TRUE)
#  A   B   C   D 
# 10  70 101 146 
rowSums (my_data_frame, na.rm = TRUE)
#row_1 row_2 row_3 row_4 row_5 
#   77    65    11   172     2 
colMeans(my_data_frame, na.rm = TRUE)
#    A     B     C     D 
# 2.50 17.50 25.25 36.50 
rowMeans(my_data_frame, na.rm = TRUE)
#    row_1     row_2     row_3     row_4     row_5 
#25.666667 16.250000  3.666667 43.000000  1.000000

5) We can apply a function to every cell in a data frame using the R-package purrr that is part of the ‘tidyverse’ family and its function modify().

# install R-package 'purrr' if not present
install.packages("purrr")
library(purrr)

We will apply a function that assigns 1 to zero and positive numbers in the data frame and 0 to negative numbers. Notice we typed purrr::modify to indicate that we what the function modify() only from the purrr R-package. I also created a function that for every x (a cell value) will determine if this x is more or equal to 0. This function will return 1 if TRUE or 0 if FALSE by using the function ifelse(). This function has three parts: a conditional test, output_1 if TRUE and output_2 if FALSE.

my_data_frame_1_0 <- purrr::modify(my_data_frame,function(x)ifelse(x>=0,1,0))
rownames(my_data_frame_1_0) <- rownames(my_data_frame)
my_data_frame_1_0
#       A  B  C  D
#row_1  1 NA  0  1
#row_2  1  1  1  1
#row_3 NA  1  1  1
#row_4  1  1  1  1
#row_5  1  0 NA NA

6) You can add a new column in a data frame to store the result of a calculation.

As an example, we will create a sum of all values per row names into a column named sum_of_my_rows.

my_data_frame$sum_of_my_rows <- rowSums (my_data_frame, na.rm = TRUE)
my_data_frame
#       A  B  C  D sum_of_my_rows
#row_1  1 NA -2 78             77
#row_2  2  4  3 56             65
#row_3 NA  1  1  9             11
#row_4  3 67 99  3            172
#row_5  4 -2 NA NA              2

7) You can also add a new row to the data frame and store the result of a calculation using the function rbind().

my_column_sums <- colSums (my_data_frame, na.rm = TRUE)
# we add this row using rbind() and modify my_data_frame
my_data_frame <- rbind(my_data_frame, my_column_sums)
my_data_frame
#       A  B   C   D sum_of_my_rows
#row_1  1 NA  -2  78             77
#row_2  2  4   3  56             65
#row_3 NA  1   1   9             11
#row_4  3 67  99   3            172
#row_5  4 -2  NA  NA              2
#6     10 70 101 146            327

Next, we can update the names of the added rows following the format row_x where x is the number of the new row.

rownames(my_data_frame)[6] <- "row_6"
my_data_frame
#       A  B   C   D sum_of_my_rows
#row_1  1 NA  -2  78             77
#row_2  2  4   3  56             65
#row_3 NA  1   1   9             11
#row_4  3 67  99   3            172
#row_5  4 -2  NA  NA              2
#row_6 10 70 101 146            327

8) Another very useful application that can modify data frames is the result of applying the function merge() that can concatenate two data frames using a common column that serves as an index.

For example, we could merge my_data_frame and my_data_frame_1_0. For this purpose, we need to prepare to my_data_frame_1_0 to have the same number of rows and different column names.

First, add an extra row to my_data_frame_1_0, so both data frames have the same number of rows with the same row names.

my_data_frame_1_0[6,] <- c(NA,NA,NA,NA)
rownames(my_data_frame_1_0)[6] <- "row_6"
my_data_frame_1_0
#       A  B  C  D
#row_1  1 NA  0  1
#row_2  1  1  1  1
#row_3 NA  1  1  1
#row_4  1  1  1  1
#row_5  1  0 NA NA
#row_6 NA NA NA NA

Second, we will change the names of columns of my_data_frame_1_0 so these do not overlap with those of my_data_frame.

names_my_data_frame_1_0 <- names(my_data_frame_1_0)
new_names_my_data_frame_1_0 <- paste0(names_my_data_frame_1_0, "_as_1_0")
names(my_data_frame_1_0) <- new_names_my_data_frame_1_0
my_data_frame_1_0
#      A_as_1_0 B_as_1_0 C_as_1_0 D_as_1_0
#row_1        1       NA        0        1
#row_2        1        1        1        1
#row_3       NA        1        1        1
#row_4        1        1        1        1
#row_5        1        0       NA       NA
#row_6       NA       NA       NA       NA

Third, we will create a common column with same name, i.e., index, in both data frames that we want to merge. In this case using the names of the rows as cell values for the common column named index.

my_data_frame$index <- rownames(my_data_frame)
my_data_frame_1_0$index <- rownames(my_data_frame_1_0)

Finally, we can now merge both data frames using the column named index using merge() with the argument by = 'index'.

my_combined_data_frame <- merge(my_data_frame, my_data_frame_1_0, by = 'index')
my_combined_data_frame
#  index  A  B   C   D sum_of_my_rows A_as_1_0 B_as_1_0 C_as_1_0 D_as_1_0
#1 row_1  1 NA  -2  78             77        1       NA        0        1
#2 row_2  2  4   3  56             65        1        1        1        1
#3 row_3 NA  1   1   9             11       NA        1        1        1
#4 row_4  3 67  99   3            172        1        1        1        1
#5 row_5  4 -2  NA  NA              2        1        0       NA       NA
#6 row_6 10 70 101 146            327       NA       NA       NA       NA