Session 4 – Text Editing and Data Transformations
4.1 Working with character strings
1) A character vector contains text (e.g., nucleotide or amino acid sequence) that is represented by a sequence of characters (e.g., A, T, C, G, U, Met). A character vector can also contain numbers that will be considered as text (i.e., not quantities). For example, we will download sequence of a COX1 mitochondrial gene of a charismatic poison frog Ameerega bilinguis.
We will follow the example from previous sessions using the R-package rentrez.
library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")
# Download some nucleotide sequences from NCBI for Ameerega bilinguis[Organism] that correspond to the gene COI
<- "Ameerega bilinguis[Organism] AND COI[Gene]"
froggy_name_COX1 <- entrez_search(db="nuccore", term= froggy_name_COX1)
froggy_seq_IDs #Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
# Search term (as translated): "Ameerega bilinguis"[Organism] AND COI[Gene]
# revising the structure of 'froggy_seq_IDs' that there are 1 sequence in NCBI nuccore database
str(froggy_seq_IDs)
#List of 5
# $ ids : chr "1952638358"
# $ count : int 1
# $ retmax : int 1
# $ QueryTranslation: chr "\"Ameerega bilinguis\"[Organism] AND COI[Gene]"
# $ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
# - attr(*, "class")= chr [1:2] "esearch" "list"
<- entrez_fetch(db="nuccore", id=froggy_seq_IDs$ids, rettype="fasta")
froggy_seqs_fasta
froggy_seqs_fasta#[1] ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG\nCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT\nACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA\nATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT\nTTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG\nCGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC\nCCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT\nTTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC\nTGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG\nACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC\nTGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT\nTGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT\nGGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT\nACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT\nGCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT\nGGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG\nTAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT\nTCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC\nGTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC\nCTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT\nTATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT\nAATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT\n\n"
The froggy_seqs_fasta
object is a character vector as indicated by the sequence of characters (i.e., letters and numbers) contained between quotes " "
. Likewise, we can identify that this vector contains a sequence in a fasta format. We will dissect this character vector.
# 1) A fasta header starting with '>', the NCBI accession number 'MW042030.1'
# >MW042030.1
# 2) A defition as is present in the GenBank format for this sequence 'Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial'
# Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 3) A new line coding indicated by '\n'
# \n
# 4) The actual nucleotide sequence in capital letters with some new line characters '\n' interspersed in the sequence
# GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG\nCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT\nACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA\nATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT\nTTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG\nCGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC\nCCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT\nTTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC\nTGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG\nACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC\nTGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT\nTGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT\nGGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT\nACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT\nGCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT\nGGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG\nTAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT\nTCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC\nGTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC\nCTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT\nTATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT\nAATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT
# 5) Final a couple of new line '\n' at the end of the vector
# \n\n
2) We can explore the character vector froggy_seqs_fasta
with several functions.
We can check the length of this vector (i.e., how many elements this vector contains) with the function length()
. It should just 1 element, the COX1 of Ameerega bilinguis.
length(froggy_seqs_fasta)
#[1] 1
We can explore other properties of this vector using the functions class()
and str()
.
class(froggy_seqs_fasta)
#[1] "character"
str(froggy_seqs_fasta)
#chr ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochon"| __truncated__
We can also test if this vector is character vector using the function is.character()
. The result is a logical TRUE or FALSE. We expect to be TRUE.
is.character(froggy_seqs_fasta)
#[1] TRUE
We can also determine the actual number characters in this vector (i.e., the total number of letters, symbols, spaces, numbers) using the function nchar()
.
nchar(froggy_seqs_fasta)
#[1] 1679
3) We see that the vector froggy_seqs_fasta
has other text than the actual nucleotide sequence. We will extract this sequence using a combination of functions that work with character vectors.
First, we can split this vector by the new line ‘’ using the function strsplit()
and test what type of data structure we obtain using class()
.
<- strsplit(froggy_seqs_fasta, split = '\n')
froggy_seq_split
froggy_seq_split#[[1]]
# [1] ">MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial"
# [2] "GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGG"
# [3] "CATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATTAAGCCAGCCCGGGTCCTT"
# [4] "ACTAGGCGATGACCAGATCTACAACGTTATTGTTACCGCCCATGCTTTCGTTATAATCTTTTTTATAGTA"
# [5] "ATGCCAATTCTAATCGGTGGCTTTGGGAATTGATTAGTGCCCCTAATAATTGGAGCCCCAGACATAGCTT"
# [6] "TTCCCCGAATAAACAATATGAGCTTTTGGCTTCTTCCCCCCTCTTTCCTACTACTCCTAGCATCCGCAGG"
# [7] "CGTTGAAGCAGGCGCCGGTACTGGCTGAACTGTGTACCCTCCCCTTGCAGGCAACCTAGCTCATGCTGGC"
# [8] "CCATCAGTTGATTTAACTATTTTTTCACTTCATCTCGCCGGTGTTTCTTCTATTCTAGGGGCAATTAACT"
# [9] "TTATTACAACAACCTTAAACATAAAACCCCCTTCATTAACACAATATCAAACCCCATTATTTGTCTGATC"
#[10] "TGTATTAATTACTGCAGTCCTTCTTCTTCTCTCCCTCCCAGTTCTGGCTGCCGGAATCACTATACTCTTG"
#[11] "ACTGACCGAAACCTAAACACCACCTTCTTTGACCCAGCAGGTGGAGGCGACCCTGTCCTGTACCAACACC"
#[12] "TGTTCTGATTCTTTGGTCACCCCGAAGTCTACATCCTTATCCTGCCTGGATTTGGTATCATCTCCCATGT"
#[13] "TGTCACATTCTACTCTAGCAAAAAAGAACCCTTCGGCTATATAGGAATAGTCTGAGCTATAATATCGATT"
#[14] "GGTCTCCTAGGTTTCATTGTTTGAGCTCACCACATATTCACAACAGACCTTAATGTAGACACTCGAGCCT"
#[15] "ACTTTACCTCAGCTACTATAATCATCGCTATCCCAACAGGTGTCAAAGTCTTTAGCTGACTTGCCACCAT"
#[16] "GCACGGAGGAATTATTAAATGAGACGCCGCCATATTATGGGCTCTCGGATTCATCTTTTTATTTACAGTT"
#[17] "GGAGGACTAACTGGAATCGTTTTAGCCAACTCCTCTTTAGACATTGTTTTGCATGATACATATTATGTAG"
#[18] "TAGCCCACTTTCACTACGTTCTTTCTATGGGGGCAGTATTTGCCATTATAGCCGGCTTCGTACACTGATT"
#[19] "TCCTCTCTTTTCCGGATTTACCCTTCATGAAGCCTGAACAAAAATTCAATTTGGCGTCATATTTACCGGC"
#[20] "GTAAATTTAACATTCTTCCCCCAGCATTTCTTAGGTCTCGCAGGCATGCCTCGACGTTATTCAGACTACC"
#[21] "CTGACGCCTACACATTATGAAACACCGTTTCATCAATCGGCTCTTTAATCTCTCTAGTTGCAGTAATCAT"
#[22] "TATGATGTTTATCATTTGAGAAGCTTTCTCTTCCAAACGCCTACCTCTACCTGCAGAAATAACCCCAACT"
#[23] "AATGTAGAATGATTATACGGATCCCCCCCACCTTACCACACTTTTGAGGAAGCCGTTTACTCCAAAATT"
#[24] ""
class(froggy_seq_split)
#[1] "list"
As you can see, the object is a list and its first element is a vector with 24 elements. The sequence of COX1 is contained in 22 elements with indices between [2]
and [23]
. We can paste all of these in a single character vector.
For this purpose, we will use the function paste()
or its variant paste0()
. Notice how I indexed the elements that I want it to be collapsed in a single character vector using the argument collapse = ""
(i.e., ""
indicates nothing). The [[1]]
indicates that we want the first element of a list and the [2:23]
indicates that we want the vector elements starting from [2]
to [23]
(i.e., 2, 3, 4, 5, …, 23).
<- paste(froggy_seq_split[[1]][2:23], collapse = "")
froggy_one_fasta
froggy_one_fasta#[1] "GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAG..."
We can also name this vector with the corresponding accession number using the function names()
.
names(froggy_one_fasta) <- froggy_seq_split[[1]][1]
froggy_one_fasta# >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATT..."
4) We can make this vector more interesting by adding more COX1 sequences from three species with a similar procedure.
library(rentrez)
# if you need to install R-package 'rentrez', use install.packages("rentrez")
# Download more COX1 nucleotide sequences from NCBI for 3 more species (i.e., three terms)
<- c("Ameerega parvula[Organism] AND COI[Gene]",
several_sepecies_COX1 "Epipedobates anthonyi[Organism] AND COI[Gene]",
"Oophaga sylvatica[Organism] AND COI[Gene]")
We can only retrieve one term a time, we will use a loop to iterate between these three terms and save in an empty list my_collected_sequences
. More on loops later.
# we create an empty list to collect processed sequences
<- list()
my_collected_sequences
# the loop to retrieve sequences
for(i in 1:length(several_sepecies_COX1)) {
<- entrez_search(db="nuccore", term= several_sepecies_COX1[i])
one_seq_IDs <- entrez_fetch(db="nuccore", id=one_seq_IDs$ids, rettype="fasta")
one_seqs_fasta <- one_seqs_fasta
my_collected_sequences[[i]]
}
# we have a list 'my_collected_sequences' of 3 elements; one per species.
length(my_collected_sequences)
#[1] 3
5) We need to process these retrieved sequences to the format that only contains the nucleotides and the GenBank accession numbers. We will start with the my_collected_sequences[[2]]
that has two concatenated sequences and the create a function to process this automatically.
Notice that my_collected_sequences[[2]]
has two fasta sequences named MW042033.1
and DQ502853.1
.
2]]
my_collected_sequences[[#[1] ">MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAA
#TTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGG\nCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCC
#TT\nACTGGGCGATGACCAGATTTATAATGTAATCGTAACTGCCCATGCTTTCGTCATAATCTTCTTCATGGTT\nATGCCTATTCTAATTGGTGGATTTGGAAATTGACTGGTACCTTTAATAATTGGAGCC
#CCAGACATAGCCT\nTTCCCCGAATAAATAATATAAGTTTTTGACTTCTTCCGCCCTCATTCCTTCTTCTCTTAGCATCTGCAGG\nAGTAGAAGCAGGAGCTGGGACAGGTTGAACTGTTTATCCTCCTCTT
#GCTGGTAACCTAGCTCATGCAGGC\nCCATCAGTTGATTTAACCATTTTTTCACTCCATCTTGCAGGGGTCTCTTCAATTCTAGGAGCAATTAATT\nTTATTACAACCACATTAAACATAAAACCTCCTTCT
#TTAACTCAATACCAAACTCCTTTATTTGTCTGATC\nAGTTCTAATCACCGCAGTACTTCTTCTTCTTTCTCTTCCTGTCCTAGCTGCAGGAATTACAATGCTTTTA\nACTGATCGAAACCTTAACACCACT
#TTCTTTGACCCAGCAGGTGGAGGTGACCCCATTCTTTATCAACATC\nTCTTCTGATTTTTTGGACACCCAGAAGTTTATATTCTTATTCTCCCAGGGTTTGGAATTATCTCTCATGT\nTGTAACTTTTTAT
#TCTAGCAAAAAAGAACCTTTTGGATATATAGGAATGGTCTGAGCTATAATATCAATT\nGGCCTCTTAGGTTTTATTGTTTGAGCTCACCATATATTTACTACTGATCTTAACGTTGACACTCGAGCTT\nAC
#TTTACCTCAGCTACTATAATTATTGCCATCCCAACAGGCGTAAAAGTCTTTAGCTGACTTGCCACCAT\nGCATGGAGGAATTATTAAATGAGATGCTGCTATACTTTGAGCCTTAGGCTTTATTTTCTTGTT
#TACTGTT\nGGTGGACTAACTGGTATCGTTTTAGCTAATTCCTCTTTGGATATTGTTCTTCATGATACCTACTACGTTG\nTAGCCCACTTCCATTATGTTCTTTCTATAGGAGCAGTATTTGCCATTATAGC
#AGGATTTGTTCATTGATT\nTCCCTTATTTACCGGTTTTACTCTTCATGAAGCCTGAACAAAGATTCATTTTGGCGTCATATTTGCTGGT\nGTAAATTTAACTTTCTTCCCCCAACATTTTTTAGGATTAGC
#AGGAATACCTCGACGCTATTCAGATTATC\nCTGATGCCTACACCTTATGAAACACCGTTTCTTCTGTTGGCTCCCTAATCTCCCTTGTTGCAGTAATCAT\nCATGATATTCATTATTTGAGAAGCTTTTTC
#ATCTAAACGTCTATTTTTGAATGCAGAAATAACCCCTACT\nAATGTTGAATGATTATATGGCTCCCCTCCTCCTTATCACACATTTGAGGAAGCTGTTTATTCTAAAGTA\n\n
#>DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTATTT
#GGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCA\nGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAATCGTAACTGCCCATGCTT\nTCGTCATAA
#TCTTCTTCATGGTTATGCCTATTCTAATTGGTGGATTTGGAAATTGACTGGTACCTTTAAT\nAATTGGAGCCCCAGACATAGCCTTTCCCCGAATAAATAATATAAGTTTTTGACTTCTTCCGCCCTCATTC\
#nCTTCTTCTCTTAGCATCTACAGGAGTAGAAGCAGGAGCTGGGACAGGTTGAACTGTTTATCCTCCTCTTG\nCTGGTAACCTAGCTCATGCAGGCCCATCAGTTGATTTAACCATTTTTTCACTCCATCTTGC
#AGGGGTCTC\nTTCAATTCTAGGAGCAATTAATTTTATTACAACCACATTAAACATAAAACCTCCTTCTTTAACTCAATAC\nCAAACTCCTTTATTTGTCTGATCAGTTCTAATCACCGCAGTACTTCTTCTT
#CTTTCTCTTCCTGTCCTAG\nCTGCAGGAATTACAATGCTTTTAACTGATCGAAACCTTAACACCACTTTCTTTGACCCAGCAGGTGGAGG\nTGACCCCATTCTTTATCAACATCTCTTC\n\n"
We will split this long character vector by the unique character that defines a new fasta entry which is indicated by the >
.
<- my_collected_sequences[[2]]
one_species_all_fasta <- strsplit(one_species_all_fasta, split = '>')
one_species_split_by_fasta
one_species_split_by_fasta#[[1]]
#[2] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[3] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...
We need to remove these character vectors from the list using the function unlist()
.
<- unlist(one_species_split_by_fasta)
one_species_split_by_fasta_vector
one_species_split_by_fasta_vector
#[2] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[3] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...
We can also remove the empty element [1] ""
with the application of the logical condition x[x != ""]
where x
is our vector.
<- one_species_split_by_fasta_vector[one_species_split_by_fasta_vector != ""]
one_species_split_by_fasta_vector
one_species_split_by_fasta_vector#[1] "MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nGTGATAATTA...
#[2] "DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial\nAACCCTATATTTAGTA...
We can now process each of these element as we did on step 3), but using a loop.
# we create an empty list to collect processed sequences
<- list()
one_species_processed_seq_list
# the loop to process sequences
for(i in 1:length(one_species_split_by_fasta_vector)) {
# split and unlist
<- unlist(strsplit(one_species_split_by_fasta_vector[i], split = '\n'))
one_species_clean_data # remove empty element ""
<- one_species_clean_data[one_species_clean_data != ""]
one_species_clean_data # we know the first element will be always the accession number and the rest sequence but the length migh vary so we use length() to determine that.
<- paste(one_species_clean_data[2:length(one_species_clean_data)], collapse = "")
one_species_only_seq # we name that sequence with the corresponding accession number on the first element of 'one_species_clean_data'.
names(one_species_only_seq) <- one_species_clean_data[1]
# finally, collected the processed sequences
<- one_species_only_seq
one_species_processed_seq_list[[i]]
}
# we the unlist this final output for species to a vector
<- unlist(one_species_processed_seq_list)
one_species_end
one_species_end#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."
However, this was just applied to a single species, but we have three. Now, we will generalize this process into single process that show the beauty of automatization with R.
6) A generalization to create this sequence vectors with all previous steps together will use a more complex loop with two nested indices j
and i
. Yet, you are welcome to explore alternatives.
<- list()
list_final_fasta
for(j in 1:length(my_collected_sequences)) {
# j <- 3
<- my_collected_sequences[[j]]
one_species_all_fasta <- strsplit(one_species_all_fasta, split = '>')
one_species_split_by_fasta <- unlist(one_species_split_by_fasta)
one_species_split_by_fasta_vector <- one_species_split_by_fasta_vector[one_species_split_by_fasta_vector != ""]
one_species_split_by_fasta_vector
<- list()
one_species_processed_seq_list
for(i in 1:length(one_species_split_by_fasta_vector)) {
<- unlist(strsplit(one_species_split_by_fasta_vector[i], split = '\n'))
one_species_clean_data <- one_species_clean_data[one_species_clean_data != ""]
one_species_clean_data <- paste(one_species_clean_data[2:length(one_species_clean_data)], collapse = "")
one_species_only_seq names(one_species_only_seq) <- one_species_clean_data[1]
<- one_species_only_seq
one_species_processed_seq_list[[i]]
}
<- unlist(one_species_processed_seq_list)
list_final_fasta[[j]]
}
<- unlist(list_final_fasta)
final_fasta_vector
final_fasta_vector#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."
#...
7) We can append to the top the first processed sequence on step 3) to this long vector.
<- c(froggy_one_fasta, final_fasta_vector)
final_fasta_vector
final_fasta_vector# >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAATT..."
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACCCGATGATTATTCTCCACAAACCATAAAGATATCGGAACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATT..."
#DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACTGGGCGATGACCAGATTTATAATGTAA..."
#...
8) We will explore some properties of the resulting character vector.
We can get how many elements (sequences) we have in the vector with the function length()
. It has 24 sequences.
length(final_fasta_vector)
1] 24 [
We can get the number of characters (i.e., letters or nucleotides) on each sequence the function nchar()
.
nchar(final_fasta_vector)
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 1539
# MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 1539
#MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 1539
# DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 658
# MH589864.1 Oophaga sylvatica isolate CH2 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
# 523
#...
We can get change the case characters using the function tolower()
that converts any upper case characters into lower case. The opposite procedure can be achieved with the function toupper()
that converts any lower case characters into upper case.
tolower(final_fasta_vector)[1]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"gtgataattactcgatgattattttctaccaaccacaaagacatcggaactttatacctagtgtttggggcatgagcaggcatagtcggcactgctcttagcctttt..."
toupper(final_fasta_vector)[1]
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTA..."
9) We can split the character vector into chucks of a defined length using following function.
We want to split the vector into individual nucleotides.
<- final_fasta_vector[1]
one_sequence names(one_sequence) <- NULL
<- unlist(strsplit(one_sequence, split = ""))
one_sequence_nucleotides
one_sequence_nucleotides# [1] "G" "T" "G" "A" "T" "A" "A" "T" "T" "A" "C" "T" "C" "G" "A" "T" "G" "A" "T" "T" "A" "T" "T" "T" "T" "C" "T" "A" "C" "C" "A" "A"
# [33] "C" "C" "A" "C" "A" "A" "A" "G" "A" "C" "A" "T" "C" "G" "G" "A" "A" "C" "T" "T" "T" "A" "T" "A" "C" "C" "T" "A" "G" "T" "G" "T"
# [65] "T" "T" "G" "G" "G" "G" "C" "A" "T" "G" "A" "G" "C" "A" "G" "G" "C" "A" "T" "A" "G" "T" "C" "G" "G" "C" "A" "C" "T" "G" "C" "T"
#...
We can also split one continuous sequence into codons using a the function sapply()
. This is a very useful function to process a function into recursively loop applied to a vector.
<- final_fasta_vector[1]
one_sequence names(one_sequence) <- NULL
<- sapply(seq(from=1, to=nchar(one_sequence), by=3),
one_sequence_codons function(i) substr(one_sequence, i, i+2))
one_sequence_codons# [1] "GTG" "ATA" "ATT" "ACT" "CGA" "TGA" "TTA" "TTT" "TCT" "ACC" "AAC" "CAC" "AAA" "GAC" "ATC" "GGA" "ACT" "TTA" "TAC" "CTA" "GTG"
#[22] "TTT" "GGG" "GCA" "TGA" "GCA" "GGC" "ATA" "GTC" "GGC" "ACT" "GCT" "CTT" "AGC" "CTT" "TTA" "ATT" "CGA" "GCC" "GAA" "TTA" "AGC"
#[43] "CAG" "CCC" "GGG" "TCC" "TTA" "CTA" "GGC" "GAT" "GAC" "CAG" "ATC" "TAC" "AAC" "GTT" "ATT" "GTT" "ACC" "GCC" "CAT" "GCT" "TTC"
#...
10) We can transform some characters and replace between characters using the function chartr()
. We can try to replace thymine (T) with uracil (U) for DNA to RNA sequences.
chartr("T", "U", final_fasta_vector[1])
#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GUGAUAAUUACUCGAUGAUUAUUUUCUACCAACCACAAAGACAUCGGAACUUUAUACCUAGUGUUUGGGGCAUGAGCAGGCAUAGUCGGCACUGCUCUUAGCCUUUUAAUUCG..."
4.2 Regular expressions (Regex)
Finding patterns of characters in strings can be done in many different ways, some are extremely useful, and many cases are tedious. In general, they follow a similar procedure that starts with finding a pattern and indexing it, replacing this pattern with other string, testing for the presence/absence of the pattern, and extracting characters nearby or between such known patterns. However, they can be hard to implement, and I will present some few applications that I find useful for bioinformatics while working with sequences.
1) Finding patterns with grep()
and grepl()
. These functions will search for matches to an argument pattern
within each element of a character vector.
We can try to find if any of our sequences have a particular set of characters (i.e., nucleotide sequences) and report which element has them within the final_fasta_vector
.
The result below indicate that only sequence with index [1]
and [2]
have such pattern.
<- "ACTGGCTGAACTGTGTACCC"
my_pattern <- grep(pattern = my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE, invert = FALSE)
what_sequences
what_sequences#[1] 1 2
final_fasta_vector[what_sequences]#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAA..."
#MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCCACCAACCATAAAGATATTGGAACCTTATACCTAGTGTTTGGGGCGTGAGCGGGCATAGTCGGCACTGCTCTTAGCCTCTTAATTCGAGCCGAG..."
A similar result can be obtained with grepl()
, but it will be a logical vector and only those elements that match will have a TRUE value and are returned when what_sequences
is used to index the final_fasta_vector
.
<- "ACTGGCTGAACTGTGTACCC"
my_pattern <- grepl(pattern = my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
what_sequences#[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
final_fasta_vector[what_sequences]#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCTACCAACCACAAAGACATCGGAACTTTATACCTAGTGTTTGGGGCATGAGCAGGCATAGTCGGCACTGCTCTTAGCCTTTTAATTCGAGCCGAA..."
#MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"GTGATAATTACTCGATGATTATTTTCCACCAACCATAAAGATATTGGAACCTTATACCTAGTGTTTGGGGCGTGAGCGGGCATAGTCGGCACTGCTCTTAGCCTCTTAATTCGAGCCGAG..."
2) Replacing patterns with sub()
and gsub()
. These functions will search for matches to an argument pattern
and replace
them with a text that you must provide. The sub()
will replace the first occurrence leaving the others as they are. On the other hand, the gsub()
will replace all the strings or values with the input pattern.
<- "CAAACAAA"
my_pattern <-"!!!found_it!!!"
my_replacement <- sub(pattern = my_pattern,
fasta_vector_replacements replacement = my_replacement,
final_fasta_vector)
fasta_vector_replacements#...
#DQ502853.1Epipedobatesanthonyiisolate838cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACT..."
#MH589864.1OophagasylvaticaisolateCH2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589863.1OophagasylvaticaisolateCH1cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCTCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589862.1OophagasylvaticaisolateMC2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGGTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGTTACATAGGAAT..."
#...
# using gsub
<- gsub(pattern = my_pattern,
fasta_vector_replacements replacement = my_replacement,
final_fasta_vector)
fasta_vector_replacements#...
#DQ502853.1Epipedobatesanthonyiisolate838cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"AACCCTATATTTAGTATTTGGGGCATGAGCCGGAATAGTAGGAACAGCCCTAAGCCTCTTAATTCGAGCAGAATTAAGTCAACCTGGGTCCTTACT..."
#MH589864.1OophagasylvaticaisolateCH2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589863.1OophagasylvaticaisolateCH1cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGCTTCGGAATCATCTCTCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGCTACATAGGAAT..."
#MH589862.1OophagasylvaticaisolateMC2cytochromeoxidasesubunitI(COI)gene,partialcds;mitochondrial
#"TTACATTCTCATCCTCCCAGGGTTCGGAATCATCTCCCATGTAGTCACGTTTTACT!!!found_it!!!AAAGAGCCATTTGGTTACATAGGAAT..."
#...
3) We can match multiple patterns using the |
that signifies or
. We will use grepl()
to exemplify its use.
Here we are testing if it matches any of two patterns. These might not need to be adjacent (i.e., there are some chucks of sequence between them).
<- c("TCATCTCCCATGTA", "GTGATAATTACTCGATGATTATTTT")
my_pattern <- paste0(my_pattern, collapse = "|")
my_pattern
my_pattern#[1] "TCATCTCCCATGTA|GTGATAATTACTCGATGATTATTTT"
<- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
what_sequences#[[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[23] TRUE TRUE
Here we are testing if the vector elements match both patterns at the same time. These are not necessarily adjacent (i.e., there could be a chuck sequence between them). We can use wildcard
character .
(i.e., it represents any character except \n
, which is a new line) and *
is a greedy character (i.e., the asterisk will always match the longest possible string).
<- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0(my_pattern, collapse = ".*")
my_pattern
my_pattern#[1] "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"
<- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = FALSE)
what_sequences
what_sequences# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
Notice that we can overrun the wildcard
nature of these characters by changing the argument fixed = TRUE
. This means a literal match to .*
in the sequence.
<- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0(my_pattern, collapse = ".*")
my_pattern
my_pattern#[1] "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"
<- grepl(my_pattern, final_fasta_vector, ignore.case = FALSE, fixed = TRUE)
what_sequences
# This result in all cases as FALSE because we fixed to match EXACTLY the text pattern "GTGATAATTACTCGATGATTATTTTCTA.*ACTTTATACCTAGTGTTTGGGGCA"
what_sequences# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
4) We can use some characters to match only numbers, only lower case letters or special symbols. These can be used to find or replace text after or before these special characters.
To illustrate some of their use we will construct a vector with random text. We will use the function cat()
to concatenate and print this vector.
<- c("text_file.txt", "text_file.csv", "text_file2.txt", "multiple\ndots\nin\nthis.line", "multiple\tdots\tin\tthis.line", "multiple.dots.in.this.line", "CaPiTaL_123_NYC")
my_random_vector
my_random_vector#[1] "text_file.txt" "text_file.csv" "text_file2.txt" "multiple\ndots\nin\nthis.line"
#[5] "multiple\tdots\tin\tthis.line" "multiple.dots.in.this.line" "CaPiTaL_123_NYC"
cat(my_random_vector)
#text_file.txt text_file.csv text_file2.txt multiple
#dots
#in
#this.line multiple dots in this.line multiple.dots.in.this.line CaPiTaL_123_NYC
We can use gsub()
to replace both \n
(new line) and \t
(tabs) using |
that means an or
condition with a -
as a replacement. Compare output before and after the replacement.
<- gsub(pattern = "\n|\t", replacement = "-", my_random_vector)
my_vector_processed 4:5]
my_random_vector[#[1] "multiple\ndots\nin\nthis.line" "multiple\tdots\tin\tthis.line"
4:5]
my_vector_processed[#[1] "multiple-dots-in-this.line" "multiple-dots-in-this.line"
We can use gsub()
to replace all digits (i.e., 0 to 9) using [[:digit:]]
with -
replacement. Compare output before and after the replacement.
<- gsub(pattern = "[[:digit:]]", replacement = "-", my_random_vector)
my_vector_processed c(3,7)]
my_random_vector[#[1] "text_file2.txt" "CaPiTaL_123_NYC"
c(3,7)]
my_vector_processed[#[1] "text_file-.txt" "CaPiTaL_---_NYC"
We can use gsub()
to replace all non-digits using \\D
or [^[:digit:]]
with -
replacement. Notice the ^
that is used to negate (i.e., the opposite of digits, will be non-digits). Compare output before and after the replacement.
<- gsub(pattern = "\\D", replacement = "-", my_random_vector)
my_vector_processed
my_random_vector#[1] "text_file.txt" "text_file.csv" "text_file2.txt" "multiple\ndots\nin\nthis.line"
#[5] "multiple\tdots\tin\tthis.line" "multiple.dots.in.this.line" "CaPiTaL_123_NYC"
my_vector_processed#[1] "-------------" "-------------" "---------2----" "--------------------------"
#[5] "--------------------------" "--------------------------" "--------123----"
<- gsub(pattern = "[^[:digit:]]", replacement = "-", my_random_vector)
my_vector_processed
my_vector_processed#[1] "-------------" "-------------" "---------2----" "--------------------------"
#[5] "--------------------------" "--------------------------" "--------123----"
We can use gsub()
to replace all periods using[.]
with spaces using or with nothing as
""
. Compare output before and after the replacement.
<- gsub(pattern = "[.]", replacement = " ", my_random_vector)
my_vector_processed 6]
my_random_vector[#[1] "multiple.dots.in.this.line"
6]
my_vector_processed[#[1] "multiple dots in this line"
<- gsub(pattern = "[.]", replacement = "", my_random_vector)
my_vector_processed 6]
my_vector_processed[#[1] "multipledotsinthisline"
5) We can use some anchor characters ^
(for START of the string) and $
(for END of the string). These are especially characters if you want to replace or find a pattern like the extension of a file.
These strings have the text
pattern at the beginning.
<- grepl(pattern = "^text", my_random_vector, ignore.case = FALSE, fixed = FALSE)
these_strings
my_random_vector[these_strings]#[1] "text_file.txt" "text_file.csv" "text_file2.txt"
We can get the file names that end with .txt
pattern.
<- grepl(pattern = "*.txt$", my_random_vector, ignore.case = FALSE, fixed = FALSE)
these_are_txt_files
my_random_vector[these_are_txt_files]#[1] "text_file.txt" "text_file2.txt"
6) An interesting application of regular expressions is to extract text (e.g., nucleotide sequence) between two patterns. This is a powerful demonstration of grepl()
and gsub()
.
<- c("GTGATAATTACTCGATGATTATTTTCTA", "ACTTTATACCTAGTGTTTGGGGCA")
my_pattern <- paste0("^.*", my_pattern[1], "\\s*|\\s*", my_pattern[2], ".*$")
my_pattern
my_pattern#[1] "^.*GTGATAATTACTCGATGATTATTTTCTA\\s*|\\s*ACTTTATACCTAGTGTTTGGGGCA.*$"
<- grepl(pattern = my_pattern, final_fasta_vector)
is_the_pattern_present
is_the_pattern_present# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[23] FALSE FALSE
<- final_fasta_vector[is_the_pattern_present]
sequences_that_have_pattern <- gsub(pattern = my_pattern, replacement = '', sequences_that_have_pattern)
my_extracted_sequences
my_extracted_sequences#>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#"CCAACCACAAAGACATCGGA"
4.3 Comparing, ordering and simplifying vectors
1) There are several operations that can be applied to vectors that can help you manage the information contained in them. They are applied to one or two vectors. We will illustrate some examples.
We will build two relatively similar vectors.
<- c("Allobates", "Epipedobates", "Dendrobates", "Hyloxalus", "Mannophryne", "Hyloxalus", "Silverstoneia")
my_vector_1 <- c("Epipedobates", "Epipedobates","Dendrobates", "Hyloxalus", "Mannophryne", "Hyloxalus", "Rheobates") my_vector_2
We can remove duplicated entries using the function unique()
.
unique(my_vector_1)
#[1] "Allobates" "Epipedobates" "Dendrobates" "Hyloxalus" "Mannophryne" "Silverstoneia"
unique(my_vector_2)
#[1] "Epipedobates" "Dendrobates" "Hyloxalus" "Mannophryne" "Rheobates"
2) The union of elements of two vectors is performed with union()
. This function will result in a new vector that contain elements from both vectors even if they are not in common but discards any duplicated values.
union(my_vector_1, my_vector_2)
#[1] "Allobates" "Epipedobates" "Dendrobates" "Hyloxalus" "Mannophryne" "Silverstoneia" "Rheobates"
3) The intersection of elements of two vectors is determined with intersect()
. This function will result in a new vector that contain elements from both vectors, and these are elements in common, but discards any duplicated values.
intersect(my_vector_1, my_vector_2)
#[1] "Epipedobates" "Dendrobates" "Hyloxalus" "Mannophryne"
4) The function setdiff()
results in a vector of elements that has those that are not in common between the two vectors. However, it depends on vector that corresponds to x
on the function as setdiff(x, y)
.
This will return elements unique to my_vector_1
and not in common with my_vector_2
.
setdiff(my_vector_1, my_vector_2)
#[1] "Allobates" "Silverstoneia"
This will return elements unique to my_vector_2
and not in common with my_vector_1
.
setdiff(my_vector_2, my_vector_1)
#[1] "Rheobates"
5) The function setequal()
results in a logical test (i.e., TRUE or FALSE) that determines if these two vector contain the same elements.
setequal(my_vector_2, my_vector_1)
#[1] FALSE
6) The function is.element()
results in logical test (i.e., TRUE or FALSE) that determines if an element is present in a vector.
is.element(c("Allobates", "Rheobates"), my_vector_1)
#[[1] TRUE FALSE
is.element(c("Allobates", "Rheobates"), my_vector_2)
#[1] FALSE TRUE
A similar, but more compact, form is the function %in%
.
c("Allobates", "Rheobates") %in% my_vector_1
#[[1] TRUE FALSE
c("Allobates", "Rheobates") %in% my_vector_2
#[1] FALSE TRUE
7) We can organize the elements in a vector alphabetically or numerically using the function sort()
. This ordering is by default increasing, but you can order in decreasing fashion by changing the argument decreasing = TRUE
.
sort(my_vector_1)
#[1] "Allobates" "Dendrobates" "Epipedobates" "Hyloxalus" "Hyloxalus" "Mannophryne" "Silverstoneia"
sort(my_vector_1, decreasing = TRUE)
#[1] "Silverstoneia" "Mannophryne" "Hyloxalus" "Hyloxalus" "Epipedobates" "Dendrobates" "Allobates"
8) We can concatenate two vectors with the function c()
. We can also remove duplicated entries and order the final vector with the functions unique()
and sort()
.
<- sort(unique(c(my_vector_1,my_vector_2)))
my_vector_1_2
my_vector_1_2#[1] "Allobates" "Dendrobates" "Epipedobates" "Hyloxalus" "Mannophryne" "Rheobates" "Silverstoneia"
4.4 Useful functions of stringr
9) There are several specialized R-packages that deal with strings and more or less simplify what we have demonstrated above. However, it forces you to be dependent on loading these packages in the R environment before you can use them.
One of the most used is stringr that is part of the ‘tidyverse’ family. This package has an extensive manual and some functions can make some of the processing of string or character vectors easier.
# to install R-package stringr
install.packages("stringr")
library(stringr)
One useful function is str_count()
that can count the number of matches of a pattern in a string.
We can construct a data frame with the total counts of each nucleotide bases.
<- str_count(final_fasta_vector, pattern = "A")
adenine <- str_count(final_fasta_vector, pattern = "C")
cytocine <- str_count(final_fasta_vector, pattern = "G")
guanine <- str_count(final_fasta_vector, pattern = "T")
thymidine <- data.frame(n_adenine = adenine,
nucleotide_numbers n_cytocine = cytocine,
n_guanine = guanine,
n_thymidine = thymidine,
genbank_accession = names(final_fasta_vector),
stringsAsFactors = FALSE)
head(nucleotide_numbers)
# n_adenine n_cytocine n_guanine n_thymidine
#1 384 410 256 489
#2 369 399 273 498
#3 395 340 250 554
#4 169 161 108 220
#5 136 116 94 177
#6 137 117 92 177
# genbank_accession
#1 >MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#2 MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#3 MW042033.1 Epipedobates anthonyi voucher QCAZ16597 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#4 DQ502853.1 Epipedobates anthonyi isolate 838 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#5 MH589864.1 Oophaga sylvatica isolate CH2 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
#6 MH589863.1 Oophaga sylvatica isolate CH1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
Another useful function is str_locate_all()
that can find where a pattern is located in the vector elements (i.e., the position of the characters that match the pattern).
We can construct a data frame with the count number of each nucleotide bases.
<- str_locate_all(final_fasta_vector, pattern = "CTAG")
restriction_sites_to_cut_with_BFAL names(restriction_sites_to_cut_with_BFAL) <- names(final_fasta_vector)
1:2]
restriction_sites_to_cut_with_BFAL[#`>MW042030.1 Ameerega bilinguis voucher QCAZ28835 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial`
# start end
#[1,] 58 61
#[2,] 142 145
#[3,] 337 340
#[4,] 406 409
#[5,] 475 478
#[6,] 785 788
#[7,] 847 850
#[8,] 1384 1387
#$`MW042032.1 Ameerega parvula voucher QCAZ16584 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial`
# start end
#[1,] 58 61
#[2,] 475 478
#[3,] 785 788
#[4,] 847 850
#[5,] 1024 1027
#[6,] 1072 1075
#[7,] 1384 1387
4.5 Data transformations
An important application of R in data analyses is transformations (e.g., one data structure into another or merging data frames).
1) Transforming vectors into other data structures (e.g., list or data frame) are usually straightforward.
From vector to list using the function as.list()
.
<- 1:10
my_vector
my_vector#[1] 1 2 3 4 5 6 7 8 9 10
<- as.list(my_vector)
my_list my_list
From list to vector using the function unlist()
.
unlist(my_list)
#[1] 1 2 3 4 5 6 7 8 9 10
From vector to a data frame using the function as.data.frame()
.
<- as.data.frame(my_vector, stringsAsFactors = FALSE)
my_data_frame head(my_data_frame)
# my_vector
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
2) Apply a function as a loop to each element of a list using the function lapply()
.
# create a list of two vectors of numbers
<- list(odd_numbers = c(1,3,5,7,9),
my_list2 prime_numbers = c(1,2,3,5,7,11,13,17))
my_list2#$odd_numbers
#[1] 1 3 5 7 9
#$prime_numbers
#[1] 1 2 3 5 7 11 13 17
Then, we can apply the function mean()
to both elements (the two numbers vectors) in my_list2
using the loop function lapply()
that it is used for lists.
lapply(my_list2, mean)
#$odd_numbers
#[1] 5
#$prime_numbers
#[1] 7.375
3) Apply a function to rows or columns of a data frame or matrix using the function apply()
.
We can create a data frame of numbers in columns.
<- data.frame(A = c(1,2,NA,3,4),
my_data_frame B = c(NA,4,1,67,-2),
C = c(-2,3,1,99,NA),
D = c(78,56,9,3,NA), stringsAsFactors = FALSE)
Then, we can add the name of the with the function rownames()
to each row of the data frame.
rownames(my_data_frame) <- c("row_1", "row_2", "row_3", "row_4", "row_5")
my_data_frame# A B C D
# row_1 1 NA -2 78
# row_2 2 4 3 56
# row_3 NA 1 1 9
# row_4 3 67 99 3
# row_5 4 -2 NA NA
Now we can apply a function max()
to every row. Notice the 1
in apply()
that denotes that it will get the maximum value per each individual row.
<- names(my_data_frame)
names_columns <- apply(my_data_frame,1,max)
max_values names(max_values) <- rownames(my_data_frame)
max_values#row_1 row_2 row_3 row_4 row_5
# NA 56 NA 99 NA
Notice that some values were NA
(no data), to remove those NA
while applying the function max()
, we need the argument na.rm = TRUE
.
<- apply(my_data_frame,1,max, na.rm = TRUE)
max_values_no_NA names(max_values_no_NA) <- rownames(my_data_frame)
max_values_no_NA#row_1 row_2 row_3 row_4 row_5
# 78 56 9 99 4
To apply a function max()
to every column, we will use 2
in apply()
.
<- apply(my_data_frame,2,max, na.rm = TRUE)
max_values_columns_no_NA names(max_values_columns_no_NA) <- names(my_data_frame)
max_values_columns_no_NA# A B C D
# 4 67 99 78
4) Here are some build-in and useful functions that can be applied to rows and columns.
colSums (my_data_frame, na.rm = TRUE)
# A B C D
# 10 70 101 146
rowSums (my_data_frame, na.rm = TRUE)
#row_1 row_2 row_3 row_4 row_5
# 77 65 11 172 2
colMeans(my_data_frame, na.rm = TRUE)
# A B C D
# 2.50 17.50 25.25 36.50
rowMeans(my_data_frame, na.rm = TRUE)
# row_1 row_2 row_3 row_4 row_5
#25.666667 16.250000 3.666667 43.000000 1.000000
5) We can apply a function to every cell in a data frame using the R-package purrr that is part of the ‘tidyverse’ family and its function modify()
.
# install R-package 'purrr' if not present
install.packages("purrr")
library(purrr)
We will apply a function that assigns 1
to zero and positive numbers in the data frame and 0
to negative numbers. Notice we typed purrr::modify
to indicate that we what the function modify()
only from the purrr
R-package. I also created a function that for every x
(a cell value) will determine if this x
is more or equal to 0
. This function will return 1
if TRUE or 0
if FALSE by using the function ifelse()
. This function has three parts: a conditional test, output_1 if TRUE and output_2 if FALSE.
<- purrr::modify(my_data_frame,function(x)ifelse(x>=0,1,0))
my_data_frame_1_0 rownames(my_data_frame_1_0) <- rownames(my_data_frame)
my_data_frame_1_0# A B C D
#row_1 1 NA 0 1
#row_2 1 1 1 1
#row_3 NA 1 1 1
#row_4 1 1 1 1
#row_5 1 0 NA NA
6) You can add a new column in a data frame to store the result of a calculation.
As an example, we will create a sum of all values per row names into a column named sum_of_my_rows
.
$sum_of_my_rows <- rowSums (my_data_frame, na.rm = TRUE)
my_data_frame
my_data_frame# A B C D sum_of_my_rows
#row_1 1 NA -2 78 77
#row_2 2 4 3 56 65
#row_3 NA 1 1 9 11
#row_4 3 67 99 3 172
#row_5 4 -2 NA NA 2
7) You can also add a new row to the data frame and store the result of a calculation using the function rbind()
.
<- colSums (my_data_frame, na.rm = TRUE)
my_column_sums # we add this row using rbind() and modify my_data_frame
<- rbind(my_data_frame, my_column_sums)
my_data_frame
my_data_frame# A B C D sum_of_my_rows
#row_1 1 NA -2 78 77
#row_2 2 4 3 56 65
#row_3 NA 1 1 9 11
#row_4 3 67 99 3 172
#row_5 4 -2 NA NA 2
#6 10 70 101 146 327
Next, we can update the names of the added rows following the format row_x
where x
is the number of the new row.
rownames(my_data_frame)[6] <- "row_6"
my_data_frame# A B C D sum_of_my_rows
#row_1 1 NA -2 78 77
#row_2 2 4 3 56 65
#row_3 NA 1 1 9 11
#row_4 3 67 99 3 172
#row_5 4 -2 NA NA 2
#row_6 10 70 101 146 327
8) Another very useful application that can modify data frames is the result of applying the function merge()
that can concatenate two data frames using a common column that serves as an index.
For example, we could merge my_data_frame
and my_data_frame_1_0
. For this purpose, we need to prepare to my_data_frame_1_0
to have the same number of rows and different column names.
First, add an extra row to my_data_frame_1_0
, so both data frames have the same number of rows with the same row names.
6,] <- c(NA,NA,NA,NA)
my_data_frame_1_0[rownames(my_data_frame_1_0)[6] <- "row_6"
my_data_frame_1_0# A B C D
#row_1 1 NA 0 1
#row_2 1 1 1 1
#row_3 NA 1 1 1
#row_4 1 1 1 1
#row_5 1 0 NA NA
#row_6 NA NA NA NA
Second, we will change the names of columns of my_data_frame_1_0
so these do not overlap with those of my_data_frame
.
<- names(my_data_frame_1_0)
names_my_data_frame_1_0 <- paste0(names_my_data_frame_1_0, "_as_1_0")
new_names_my_data_frame_1_0 names(my_data_frame_1_0) <- new_names_my_data_frame_1_0
my_data_frame_1_0# A_as_1_0 B_as_1_0 C_as_1_0 D_as_1_0
#row_1 1 NA 0 1
#row_2 1 1 1 1
#row_3 NA 1 1 1
#row_4 1 1 1 1
#row_5 1 0 NA NA
#row_6 NA NA NA NA
Third, we will create a common column with same name, i.e., index
, in both data frames that we want to merge. In this case using the names of the rows as cell values for the common column named index
.
$index <- rownames(my_data_frame)
my_data_frame$index <- rownames(my_data_frame_1_0) my_data_frame_1_0
Finally, we can now merge both data frames using the column named index
using merge()
with the argument by = 'index'
.
<- merge(my_data_frame, my_data_frame_1_0, by = 'index')
my_combined_data_frame
my_combined_data_frame# index A B C D sum_of_my_rows A_as_1_0 B_as_1_0 C_as_1_0 D_as_1_0
#1 row_1 1 NA -2 78 77 1 NA 0 1
#2 row_2 2 4 3 56 65 1 1 1 1
#3 row_3 NA 1 1 9 11 NA 1 1 1
#4 row_4 3 67 99 3 172 1 1 1 1
#5 row_5 4 -2 NA NA 2 1 0 NA NA
#6 row_6 10 70 101 146 327 NA NA NA NA