Session 3 – Basic Data Structures in R

Managing data in R is at the core of its implementation in bioinformatics analyses. You can input your data by typing these directly into the R console, but for most practical uses you likely input those dataset from a file in your computer. Please refer to section 2.6 Importing and exporting data for specific examples. However, we will work in this chapter with both typed data and an imported dataset from the LINK TO FILES of BIO/BIT 209 to your computer as indicated in section 2.5 Downloading data files (GitHub).

# Here is an exemplar dataset to be used here:

setwd("~/Desktop/Teach_R/class_datasets")

my_imported_dataset <- read.table (file = "~/Desktop/Teach_R/class_datasets/mtcars2_file_tab.txt", 
                                    header = TRUE, 
                                       sep = "\t",
                          stringsAsFactors = FALSE)

## This dataset will be imported as a data.frame
str(my_imported_dataset)
#'data.frame':  32 obs. of  12 variables:
# $ cars: chr  "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
# $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : int  6 6 4 6 8 6 8 4 4 6 ...
# $ disp: num  160 160 108 258 360 ...
# $ hp  : int  110 110 93 110 175 105 245 62 95 123 ...
# $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
# $ qsec: num  16.5 17 18.6 19.4 17 ...
# $ vs  : int  0 0 1 1 0 1 0 1 1 1 ...
# $ am  : int  1 1 1 0 0 0 0 0 0 0 ...
# $ gear: int  4 4 4 3 3 3 3 4 4 4 ...
# $ carb: int  4 4 1 1 2 1 4 2 2 4 ...

head(my_imported_dataset)
#               cars  mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1         Mazda_RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#2     Mazda_RX4_Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#3        Datsun_710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#4    Hornet_4_Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#5 Hornet_Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Note: You are strongly encouraged to import your project dataset(s) into R to explore and transform them into the different types of data structures below.

3.1 Scalars and vectors

We already introduced vectors and scalars on the section “Your First R Session”. We will describe with more detail vectors and continue exploring these structures and functions that apply to them.

1) A numeric vector contains numbers. Notice the functionc() that combine such values into a vector.

Here is a typed example:

my_numeric_vector <- c(1,3,45,56,1)
my_numeric_vector
#[1]  1  3 45 56  1

Here is a vector derived from my_imported_dataset:

# Notice that get the data of one column you add the '$' follow by the name of column'  
my_numeric_car_displacement <- my_imported_dataset$disp
my_numeric_car_displacement
#[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1
#[22] 318.0 304.0 350.0 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

The function str() can help to identify its numeric structure. Notice the num text in the output that this indicates that is a vector of numeric values.

str(my_numeric_vector)
#num [1:5] 1 3 45 56 1
str(my_numeric_car_displacement)
#num [1:32] 160 160 108 258 360 ...

Likewise, the function class() is also useful to characterize a vector.

class(my_numeric_vector)
#[1] "numeric"
class(my_numeric_car_displacement)
#[1] "numeric"

2) A vector of character strings contains characters. Note the quotations to contain text strings.

Here is a typed example:

my_character_vector <- c("my", "bioinformatics", "class")
my_character_vector
#[1] "my"             "bioinformatics" "class"         
str(my_character_vector)
#chr [1:3] "my" "bioinformatics" "class"
class(my_character_vector)
#[1] "character"

Here is a vector derived from my_imported_dataset:

my_character_car_names <- my_imported_dataset$cars
my_character_car_names
#[1] "Mazda_RX4"           "Mazda_RX4_Wag"       "Datsun_710"          "Hornet_4_Drive"      "Hornet_Sportabout"  
# [6] "Valiant"             "Duster_360"          "Merc_240D"           "Merc_230"            "Merc_280"           
#[11] "Merc_280C"           "Merc_450SE"          "Merc_450SL"          "Merc_450SLC"         "Cadillac_Fleetwood" 
#[16] "Lincoln_Continental" "Chrysler_Imperial"   "Fiat_128"            "Honda_Civic"         "Toyota_Corolla"     
#[21] "Toyota_Corona"       "Dodge_Challenger"    "AMC_Javelin"         "Camaro_Z28"          "Pontiac_Firebird"   
#[26] "Fiat_X1_9"           "Porsche_914_2"       "Lotus_Europa"        "Ford_Pantera_L"      "Ferrari_Dino"       
#[31] "Maserati_Bora"       "Volvo_142E" 
str(my_character_car_names)
#chr [1:32] "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" "Hornet_Sportabout" "Valiant" "Duster_360" "Merc_240D" ...
class(my_character_car_names)
#[1] "character"

3) Notice that if you include numbers with character elements, your vector will consider such numbers as characters.

my_mix_vector <- c("my", "bioinformatics", "class", 3.141593)
my_mix_vector
#[1] "my"             "bioinformatics" "class"          "3.141593"      
str(my_mix_vector)
#chr [1:4] "my" "bioinformatics" "class" "3.141593"
class(my_mix_vector)
#[1] "character"

4) A factor vector is similar to a character vector, but each unique element of this vector can be assigned a level. To do this, we use the function factor(). This factor vectors can be used in statistical analyses where discrete groups can be defined by a level.

Here is a typed example:

my_factor_vector <- c("white", "black", "white", "white", "black")
my_factor_vector <- factor(my_factor_vector)
my_factor_vector
#[1] white black white white black
#Levels: black white
str(my_factor_vector)
#Factor w/ 2 levels "black","white": 2 1 2 2 1
class(my_factor_vector)
#[1] "factor"

You can also convert any numeric vector to a factor vector.

my_factor_vector <- c(1,0,1,1,1,0)
my_factor_vector <- factor(my_factor_vector)
my_factor_vector
#[1] 1 0 1 1 1 0
#Levels: 0 1
str(my_factor_vector)
#Factor w/ 2 levels "0","1": 2 1 2 2 2 1
class(my_factor_vector)
#[1] "factor"

Here is a factor vector derived from my_imported_dataset and then appended to this same dataset:

my_vs_vector <- my_imported_dataset$vs
my_vs_vector
#[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
str(my_vs_vector)
#int [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## This is a integer vector (like a numeric vector), we can transform this into a factor vector
my_vs_vector_as_factor <- factor(my_vs_vector)
my_vs_vector_as_factor
#[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
#Levels: 0 1
str(my_vs_vector_as_factor)
#Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...

## We can add this factor vector to our data.frame 
my_imported_dataset$vs_factor <- my_vs_vector_as_factor
head(my_imported_dataset)
#               cars  mpg cyl disp  hp drat    wt  qsec vs am gear carb vs_factor
#1         Mazda_RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         0
#2     Mazda_RX4_Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         0
#3        Datsun_710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         1
#4    Hornet_4_Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1         1
#5 Hornet_Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2         0
#6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1         1
str(my_imported_dataset)
#'data.frame':  32 obs. of  13 variables:
# $ cars     : chr  "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
# $ mpg      : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl      : int  6 6 4 6 8 6 8 4 4 6 ...
# $ disp     : num  160 160 108 258 360 ...
# $ hp       : int  110 110 93 110 175 105 245 62 95 123 ...
# $ drat     : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt       : num  2.62 2.88 2.32 3.21 3.44 ...
# $ qsec     : num  16.5 17 18.6 19.4 17 ...
# $ vs       : int  0 0 1 1 0 1 0 1 1 1 ...
# $ am       : int  1 1 1 0 0 0 0 0 0 0 ...
# $ gear     : int  4 4 4 3 3 3 3 4 4 4 ...
# $ carb     : int  4 4 1 1 2 1 4 2 2 4 ...
# $ vs_factor: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...

5) A logical (Boolean) vector contains TRUE or FALSE values. These vectors and scalars are hugely important in any process that require some control flow during a set functions and calculations that require to define alternative processes. In other words, if the evaluation of a logical test is TRUE then the do some calculation to the result, but if FALSE do this other process.

# a numeric vector with number from 1 to 10
my_numeric_vector <- 1:10
my_numeric_vector
#[1]  1  2  3  4  5  6  7  8  9 10

Then, we test this vector for the condition if each of its element is more than 5 using the function ifelse(). This function is very useful and it has three components: The first part is a logical test x > 5 (i.e., if x more than 5 will be TRUE otherwise FALSE), a second part will provide the output for the x > 5 test is TRUE (in this case assign the logical value of TRUE) and a third part will provide the output for the x > 5 test is FALSE (in this case assign the logical value of FALSE).

my_logical_vector <- ifelse(my_numeric_vector > 5, TRUE, FALSE)
my_logical_vector
# [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
str(my_logical_vector)
#logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
class(my_logical_vector)
#"logical"

Here is an alternative for to get the same logical vector

my_logical_vector <- my_numeric_vector > 5
my_logical_vector
#[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Here is a logical vector derived from my_imported_dataset and then appended to this same dataset:

my_am_vector <- my_imported_dataset$am
my_am_vector
#[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
str(my_am_vector)
#int [1:32] 1 1 1 0 0 0 0 0 0 0 ...

## This is a integer vector (like a numeric vector), we can transform this into a logical vector with ifelse test
my_logical_am_vector <- ifelse(my_am_vector == 1, TRUE, FALSE)
my_logical_am_vector
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
#[22] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
str(my_logical_am_vector)
#logi [1:32] TRUE TRUE TRUE FALSE FALSE FALSE ...

## We can add this logical vector to our data.frame 
my_imported_dataset$am_logical <- my_logical_am_vector
head(my_imported_dataset)
#                cars  mpg cyl disp  hp drat    wt  qsec vs am gear carb vs_factor am_logical
#1         Mazda_RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         0       TRUE
#2     Mazda_RX4_Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         0       TRUE
#3        Datsun_710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         1       TRUE
#4    Hornet_4_Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1         1      FALSE
#5 Hornet_Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2         0      FALSE
#6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1         1      FALSE

str(my_imported_dataset)
#'data.frame':  32 obs. of  14 variables:
#$ cars      : chr  "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
#$ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#$ cyl       : int  6 6 4 6 8 6 8 4 4 6 ...
#$ disp      : num  160 160 108 258 360 ...
#$ hp        : int  110 110 93 110 175 105 245 62 95 123 ...
#$ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#$ wt        : num  2.62 2.88 2.32 3.21 3.44 ...
#$ qsec      : num  16.5 17 18.6 19.4 17 ...
#$ vs        : int  0 0 1 1 0 1 0 1 1 1 ...
#$ am        : int  1 1 1 0 0 0 0 0 0 0 ...
#$ gear      : int  4 4 4 3 3 3 3 4 4 4 ...
#$ carb      : int  4 4 1 1 2 1 4 2 2 4 ...
#$ vs_factor : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
#$ am_logical: logi  TRUE TRUE TRUE FALSE FALSE FALSE ...

6) There are also special cases of vector elements that are useful, but they can also be confusing. A NULL represents the null or an empty object in R and it can be on its own, but it cannot be with other elements in the same vector.

my_numeric_vector <- c(1,2,3,4, NULL)
my_numeric_vector
#[1] 1 2 3 4

A NA element represents a missing value in R. This element can be in a vector and updated in other R objects.

my_numeric_vector <- c(1,2,3,4, NA)
my_numeric_vector
#[1]  1  2  3  4 NA

7) We can compare vectors or a vector against an scalar (i.e., an atomic quantity or object that can hold only one value at a time) using different logical operators and this will result in logical vector containing TRUE or FALSE values (also known as Boolean values).

Note: Boolean values can serve as switches (ON/OFF) in conditional statements.

my_numbers <- 1:10
#[1]  1  2  3  4  5  6  7  8  9 10
## Here is an example using our imported dataset
my_imported_dataset$gear
#[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

The operator to test for equality == will determine if the values in the vector are equal to some value.

my_numbers == 2
#[1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
my_imported_dataset$gear == 3
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
#[22]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The operator to test for inequality is !=. Notice ! which can we used in many logical function to negate (i.e., test for the opposite that function will try to determine as TRUE).

my_numbers != 2
#[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Here is an example using our imported dataset
my_imported_dataset$gear != 3
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
#[22] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

The operator to test for less than <.

my_numbers < 2
#[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
my_imported_dataset$gear < 4
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
#[22]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The operator to test for less or equal than <=.

my_numbers <= 2
#[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
my_imported_dataset$gear <= 4
#1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#[22]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

The operator to test for more than >.

my_numbers > 2
#[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Here is an example using our imported dataset
my_imported_dataset$gear > 4
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[22] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

The operator to test for more or equal than >=.

my_numbers >= 2
#[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Here is an example using our imported dataset
my_imported_dataset$gear >= 4
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
#22] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

You can also test more complex vectors, as long as they have the same number of elements.

a <- c(2,10,4) # three elements
b <- 2:4 # three elements
a == b
# [1]  TRUE FALSE  TRUE

Here is a comparison from vectors derived from our imported dataset:

## we can compare if vs versus am values are the same, the test is one set of elements at a time
my_imported_dataset$vs == my_imported_dataset$am
#[1] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#[22]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

8) You can select specific elements of a vector by using their inherent index by its position on the set.

# sequence of numbers between 1 and 10 every 2 numbers
my_vector <- seq(1,10,2) 
my_vector
#[1] 1 3 5 7 9

if you want the third element of ‘my_vector’ then you use [3].

element_3 <- my_vector[3]
element_3
#[1] 5

9) If you want to delete an element using index, then use a minus - before the index that corresponds to the element to remove.

my_vector_without_element_3 <- my_vector[-3]
my_vector_without_element_3
#[1] 1 3 7 9

10) We can also use a vector of indices to select multiple elements within a given vector (e.g., my_vector).

my_vector[c(1,2,5)] # select elements 1, 2 and 5
#[1] 1 3 9

11) We can also use a logical operator to select elements that meet condition.

# select elements that are more than 3
my_vector[my_vector > 3] 
#[1] 5 7 9

12) Some other examples: Select even or odd elements from a vector.

my_numbers
#[1]  1  2  3  4  5  6  7  8  9 10
# Select even numbers
my_numbers[my_numbers %% 2 == 0] 
#[1]  2  4  6  8 10
# Select numbers divisible by 3
my_numbers[my_numbers %% 3 == 0] 
#[1] 3 6 9

13) We can test if an element matches a set of terms using logical %in% that return TRUE if the left operand occurs in the right operand.

my_names <- c("juan", "c", "santos")
name_key <- c("peter", "juan", "randy", "david", "leeann")
name_key %in% my_names
#> [1] FALSE  TRUE FALSE FALSE FALSE
name_key[name_key %in% my_names]
#[1] "juan"

If you want the opposite (i.e., return those that do not match key terms) then we add a ! (as we did above).

name_key[!name_key %in% my_names]
#[1] "peter"  "randy"  "david"  "leeann"

14) Several standard arithmetic calculations with numeric vectors can also be done and include.

my_numbers <- 1:10
my_numbers
#[1]  1  2  3  4  5  6  7  8  9 10
#Addition
my_numbers + 1 
#[1]  2  3  4  5  6  7  8  9 10 11
#Subtraction
my_numbers - 1 
#[1] 0 1 2 3 4 5 6 7 8 9
#Multiplication
my_numbers * 2 
#[1]  2  4  6  8 10 12 14 16 18 20
#Division
my_numbers / 3 
#[1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 2.0000000 2.3333333 2.6666667 3.0000000 3.3333333
#Exponentiation
my_numbers ^ 2 
#[1]  1   4   9  16  25  36  49  64  81 100
#Other functions – I did not add the corresponding results
log(my_numbers)
sqrt(my_numbers)
sin(my_numbers)

15) You can append elements to your vector.

# append 11 to vector
my_numbers <- c(my_numbers,11)
my_numbers
#[1]  1  2  3  4  5  6  7  8  9 10 11

16) You can repeat vectors or elements of vectors using the function rep(). If you want to repeat the complete vector, for example, you specify the argument times. To repeat the vector c(0, 0, 7) for three times, use the following code.

rep(c(0, 0, 7), times = 3)
#[1] 0 0 7 0 0 7 0 0 7

You also can repeat every value by specifying the argument each, like this:

rep(c(2, 4, 2), each = 3)
#[1] 2 2 2 4 4 4 2 2 2

You can tell R for each value how often it has to be repeated.

rep(c(0, 7), times = c(4,2))
#[1] 0 0 0 0 7 7

And you can, like in seq(), use the argument length.out to tell R how long you want it to be. R will repeat the vector until it reaches that length even if the last repetition is incomplete, like so:

rep(1:3,length.out=7)
#[1] 1 2 3 1 2 3 1

3.2 Matrices

17) A matrix object is defined by elements in rows and columns (i.e., n-rows x m-columns). These behave like vectors with dimensions, but you might need to use matrix algebra for their implementation.

my_matrix <- matrix(data = c(1,2,3,11,12,13), 
                    nrow = 2, 
                    ncol = 3, 
                   byrow = TRUE)
my_matrix 
#     [,1] [,2] [,3]
#[1,]    1    2    3
#[2,]   11   12   13
str(my_matrix)
#num [1:2, 1:3] 1 11 2 12 3 13
class(my_matrix)
#[1] "matrix"
my_matrix + 1
#     [,1] [,2] [,3]
#[1,]    2    3    4
#[2,]   12   13   14
my_matrix * 2
#     [,1] [,2] [,3]
#[1,]    2    4    6
#[2,]   22   24   26

To do arithmetic functions between matrices they should follow rules of matrix algebra. We will not use these data structures that much.

3.3 Arrays

18) These objects are similar to a matrix, but it can store data in more than 2 dimensions (e.g., x * y * z ). In array objects, you can store several matrices in cube-like organization. For example, an array with dimensions (2, 3, 5) will include 5 rectangular matrices each with 2 rows and 3 columns. These objects are created with the function array() and takes vectors as input and uses the values in the argument dim to create the 3D structure of the array.

We will create an array of 5 matrices 2x3 matrices. In other words, each of these matrices has 2 rows and 3 columns, and 5 of them are stacked the cube-like structure of the array object.

# Five input vectors.
vector1 <- c(1,2,3)
vector2 <- c(4,5,6,7)
vector3 <- c(8,9,10,11,12)
vector4 <- c(13,14,15,16,17)
vector5 <- c(18,19,20,21,22)

# Take these vectors as input to the array.
my_array <- array(c(vector1,vector2,vector3,vector4,vector5),dim = c(2,3,5))
my_array
#, , 1
#
#     [,1] [,2] [,3]
#[1,]    1    3    5
#[2,]    2    4    6
#
#, , 2
#
#     [,1] [,2] [,3]
#[1,]    7    9   11
#[2,]    8   10   12
#
#, , 3
#
#     [,1] [,2] [,3]
#[1,]   13   15   17
#[2,]   14   16   18
#
#, , 4
#
#     [,1] [,2] [,3]
#[1,]   19   21    1
#[2,]   20   22    2
#
#, , 5
#
#     [,1] [,2] [,3]
#[1,]    3    5    7
#[2,]    4    6    8

str(my_array)
#num [1:2, 1:3, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
class(my_array)
#[1] "array"

## Print the second row (2) of the fifth matrix (5) of the array
my_array[2,,5]
#[1] 4 6 8

You can do calculations with these array elements.

## add the matrices 2 and 5
my_array[,,2] + my_array[,,5]
#    [,1] [,2] [,3]
#[1,]   10   14   18
#[2,]   12   16   20

## multiply matrices 2 and 5
my_array[,,2] * my_array[,,5]
#      [,1] [,2] [,3]
#[1,]   21   45   77
#[2,]   32   60   96

## Use apply to calculate the sum of the rows, which is indicated by c(1), across all the matrices. It will be vector with two elements (i.e., the sums of each of the rows of matrices in the array).
apply(my_array, c(1), sum)
#[1] 137 152

## Use apply to calculate the sum of the columns, which is indicated by c(2),  across all the matrices. It will be a vector with three elements because each matrix has three columns.
apply(my_array, c(2), sum)
#[1]  91 111  87

3.4 Data Frames

19) A data frame object is an extremely flexible object in the R environment. Most packages, functions and applications use data frames as object to store, search, transform and filter data. A data frame is similar to a matrix (or an Excel worksheet), but also behaves like a list and a table.

Long and Teetor (2019) provides some useful characteristics of a data frame. To an R programmer: A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors, but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.

Therefore, the data frame flexibility derive from some of its properties:

  1. The elements (cells) of data frame can be usually numeric, character, logical, or factor
  2. The columns and rows can have names
  3. You can call rows and columns to vectors
  4. You can filter and subset the data frame based on rules, functions and conditions
  5. You can store results of calculations in elements (cells) of a data frame
  6. You can append new columns and rows to a data frame
  7. You can coerce other data structures into data frames (e.g., collect vectors into a data frame, transform from a matrix). More complex data structures (e.g., tibbles, data.tables, biostrings objects) as usually modifications of data frames and can be also be coerced to data frames
  8. Most data (e.g., CSV or tab delimited data sets) are imported into R as data frames

20) You can build a data frame as follows by typing its contents in the R console (or copy and paste from text editor).

my_dataframe <- data.frame(my_numbers = c(1,2,3,4,5),
                              my_text = c("apple", "pear", "grape", "pomegranate", "banana"),
                          my_boroughs = c("queens", "the_bronx", "brooklyn", "manhattan", "staten_island"),
                     stringsAsFactors = FALSE)
my_dataframe
#  my_numbers     my_text   my_boroughs
#1          1       apple        queens
#2          2        pear     the_bronx
#3          3       grape      brooklyn
#4          4 pomegranate     manhattan
#5          5      banana staten_island
str(my_dataframe)
#'data.frame':  5 obs. of  3 variables:
#$ my_numbers : num  1 2 3 4 5
#$ my_text    : chr  "apple" "pear" "grape" "pomegranate" ...
#$ my_boroughs: chr  "queens" "the_bronx" "brooklyn" "manhattan" ...
class(my_dataframe)
#[1] "data.frame"

Notice the stringsAsFactors = FALSE argument while constructing (typing) the data.frame. This is to avoid R assigning variables (columns) as factors.

21) You can obtain the dimension of your data frame (i.e., number of columns and row) with the function dim().

# the object 'my_dataframe' has 5 rows and 3 columns
dim(my_dataframe)
#[1] 5 3

22) You can also import txt files (comma delimited *.csv files or tab delimited *.txt files) as data frames. More on this the section how to import data.

## NOTE: remember to update the path to file with your dataset in your computer -- THIS IS EXCLUSIVE TO YOUR COMPUTER AND IT IS NOT THE PATH SHOWN BELOW

my_dataframe <- read.table(file = "~/Desktop/Teach_R/my_dataframe_csv.csv", 
                         header = TRUE, 
                            sep = ",",
               stringsAsFactors = FALSE)

my_dataframe <- read.table(file = "~/Desktop/Teach_R/my_dataframe_tab.txt", 
                         header = TRUE, 
                            sep = "\t",
               stringsAsFactors = FALSE)

Notice that argument sep = has either a , or \t and this indicates to separate columns using commas or a tabs, respectively.

23) As mentioned, data frames are extremely flexible and you can call columns, rows or specific elements.

This returns a vector of the column named my_ boroughs.

my_dataframe$my_boroughs
#[1] "queens"        "the_bronx"     "brooklyn"      "manhattan"     "staten_island"

Same result by indicating the corresponding column that has my_ boroughs.

my_dataframe[,3]
#[1] "queens"        "the_bronx"     "brooklyn"      "manhattan"     "staten_island"

Same result.

my_dataframe[,"my_boroughs"]
#[1] "queens"        "the_bronx"     "brooklyn"      "manhattan"     "staten_island"

This returns a subset data frame with the fourth row.

my_dataframe[4,]
#  my_numbers     my_text my_boroughs
#4          4 pomegranate   manhattan

This returns an element on the first row and third column.

my_dataframe[1,3]
#[1] "queens"

24) Subsetting your data frame by a condition is usually one of the most common function applied to data frames. These conditions are used to filter or extract data from a the data frame.

This returns a data frame for the column named my_boroughs.

subset(my_dataframe, select = my_boroughs)
#    my_boroughs
#1        queens
#2     the_bronx
#3      brooklyn
#4     manhattan
#5 staten_island

This returns a subset data frame if the numbers of column my_numbers are more or equal to 3.

subset(my_dataframe, my_numbers >= 3)
#  my_numbers     my_text   my_boroughs
#3          3       grape      brooklyn
#4          4 pomegranate     manhattan
#5          5      banana staten_island

This returns a subset data frame if the column my_text contains banana.

subset(my_dataframe, my_text %in% "banana")
#  my_numbers my_text   my_boroughs
#5          5  banana staten_island

This returns a subset data frame if the column my_boroughs has the text _island. Notice that this is a special case application of the function grepl() where a text pattern _island is searched in the text (words) elements (even if this not the full word) that is contained in the column my_boroughs.

subset(my_dataframe, grepl(pattern = "_island", my_dataframe$my_boroughs))
#  my_numbers my_text   my_boroughs
#5          5  banana staten_island

25) You can append vectors as columns (e.g., variables) to a data frame as long as it has the same number of elements per column (i.e., same number of rows).

my_dataframe$my_colors <- c("red","orange","blue","black","green")
my_dataframe
#   my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green

26) You can append rows to a data frame as long as it is another data frame with the same column names.

not_NYC <- data.frame(my_numbers = 6,
                         my_text = "potato",
                     my_boroughs = "not_NYC",
                       my_colors = "purple",
                stringsAsFactors = FALSE)
my_dataframe_2 <- rbind(my_dataframe,not_NYC)
my_dataframe_2
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green
#6          6      potato       not_NYC    purple

27) You can remove columns and rows from a data frame in similar fashion as we did for vectors.

This will remove last row (i.e., row 6) of my_dataframe_2.

my_dataframe_2[-6,]
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green

This will remove second column (i.e., column 2) of my_dataframe_2.

my_dataframe_2[,-2]
#  my_numbers   my_boroughs my_colors
#1          1        queens       red
#2          2     the_bronx    orange
#3          3      brooklyn      blue
#4          4     manhattan     black
#5          5 staten_island     green
#6          6       not_NYC    purple

28) You can also change the names of the columns on a data frame easily using the function names().

my_dataframe_for_names <- my_dataframe
#get a vector of names of columns
names(my_dataframe_for_names)
#[1] "my_numbers"  "my_text"     "my_boroughs" "my_colors" 
# to change names, you provide a vector with new names
names(my_dataframe_for_names) <- c("numbers", "fruits", "boroughs", "colors")
my_dataframe_for_names
#  numbers      fruits      boroughs colors
#1       1       apple        queens    red
#2       2        pear     the_bronx orange
#3       3       grape      brooklyn   blue
#4       4 pomegranate     manhattan  black
#5       5      banana staten_island  green

You can also change one column name, but you have to indicate where in the order of the list of names of the column.

names(my_dataframe_for_names)[3] <- "the_boroughs"
my_dataframe_for_names
#  numbers      fruits  the_boroughs colors
#1       1       apple        queens    red
#2       2        pear     the_bronx orange
#3       3       grape      brooklyn   blue
#4       4 pomegranate     manhattan  black
#5       5      banana staten_island  green

29) You can remove NA elements of a data frame, but it will remove the row and column that has such values.

Let’s create a data frame that has NA elements.

my_dataframe_with_NAs <- data.frame (numbers_A = c(1,2,3,4),
                                     numbers_B = c(5,NA,7,NA),
                                     numbers_C = c(9, NA, 11,12),
                                     stringsAsFactors = FALSE)
my_dataframe_with_NAs
#  numbers_A numbers_B numbers_C
#1         1         5         9
#2         2        NA        NA
#3         3         7        11
#4         4        NA        12

We can remove all columns and rows with NA elements using the function na.omit().

na.omit(my_dataframe_with_NAs)
#  numbers_A numbers_B numbers_C
#1         1         5         9
#3         3         7        11

We can also remove the row with NA values in a given column (e.g., numbers_C) using the function complete.cases().

my_dataframe_with_NAs[complete.cases(my_dataframe_with_NAs[,"numbers_C"]),]
#  numbers_A numbers_B numbers_C
#1         1         5         9
#3         3         7        11
#4         4        NA        12

30) You can get an overview of your data frame with the following functions.

You can get its structure with str(). The object mtcars is a build in data set for ‘Motor Trend Car Road Tests’ that is usually used to illustrate examples in R.

my_long_dataframe <- mtcars
str(my_long_dataframe)
#'data.frame':  32 obs. of  11 variables:
#$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#$ disp: num  160 160 108 258 360 ...
#$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#$ qsec: num  16.5 17 18.6 19.4 17 ...
#$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#$ carb: num  4 4 1 1 2 1 4 2 2 4 ...

See the first rows with head().

head(my_long_dataframe)
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

See the last rows with tail().

tail(my_long_dataframe)
#                mpg cyl  disp  hp drat    wt qsec vs am gear carb
#Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
#Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
#Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
#Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
#Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
#Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

You can see the size and dimensions of data frame with dim().

#returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
dim(my_long_dataframe) 
#[1] 32 11

You can see the number of rows with nrow().

nrow(my_long_dataframe)
#[1] 32

You can see the number of columns with ncol().

ncol(my_long_dataframe)
#[1] 11

31) You can get an summary of data frame that includes summary statistics about each numeric column (min, max, median, mean, etc.) of the data frame.

# names of columns
names(my_long_dataframe)
#[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
summary(my_long_dataframe)
#      mpg             cyl             disp             hp             drat             wt             qsec             vs        
#Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
#1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
#Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
#Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
#3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
#Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
#      am              gear            carb      
#Min.   :0.0000   Min.   :3.000   Min.   :1.000  
#1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
#Median :0.0000   Median :4.000   Median :2.000  
#Mean   :0.4062   Mean   :3.688   Mean   :2.812  
#3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
#Max.   :1.0000   Max.   :5.000   Max.   :8.000  

3.5 Tables

32) A table object is similar to a data frame, but it can be used to summarizing or tabulating your data, especially with categorical variables. Usually, you can use a table object to summarize a data frame.

A contingency table is creates using the function table() to create an object for tabulation of counts and percentages for one or more variables. Missing values NA are excluded from the counts unless you specify the argument useNA=“ifany” or useNA=“always”.

To exemplify a table, we create a version with different number of rows of my_dataframe and append those together.

my_dataframe_3 <- rbind(my_dataframe,my_dataframe_2, my_dataframe_2[-1,])

Now, we can tabulate values on my_dataframe_3. For example, a table for number of times each borough is repeated.

table(my_dataframe_3$my_boroughs)
# brooklyn     manhattan       not_NYC        queens staten_island     the_bronx 
#        3             3             2             2             3             3 

A table for number of times the each my_text element is repeated.

table(my_dataframe_3$my_text)
#      apple      banana       grape        pear pomegranate      potato 
#          2           3           3           3           3           2 

A two-way table for my_borough and my_text. In this case, we can see in the next table when both vectors concide with the same combination of text terms (e.g., queens with apple occur 2 times in the same postion of both vectors).

table(my_dataframe_3$my_boroughs, my_dataframe_3$my_text) 
#               apple banana grape pear pomegranate potato
# brooklyn          0      0     3    0           0      0
# manhattan         0      0     0    0           3      0
# not_NYC           0      0     0    0           0      2
# queens            2      0     0    0           0      0
# staten_island     0      3     0    0           0      0
# the_bronx         0      0     0    3           0      0

33) When dealing with NA values, tables can include or exclude these in tabulations.

We can create a data frame to append with NA.

my_dataframe_4 <- data.frame(my_numbers = c(6,NA),
                         my_text = c(NA,"potato"),
                     my_boroughs = c("not_NYC",NA),
                       my_colors = c(NA,"purple"),
                stringsAsFactors = FALSE)
my_dataframe_5 <- rbind(my_dataframe_3,my_dataframe_4)

As indicated, we tabulated this my_dataframe_5. This will by default excluding NA values.

table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text)
#               apple banana grape pear pomegranate potato
# brooklyn          0      0     3    0           0      0
# manhattan         0      0     0    0           3      0
# not_NYC           0      0     0    0           0      2
# queens            2      0     0    0           0      0
# staten_island     0      3     0    0           0      0
# the_bronx         0      0     0    3           0      0

However, we can modify this to tabulated NA values with the argument useNA = "always".

table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text, useNA = "always") 
#               apple banana grape pear pomegranate potato <NA>
# brooklyn          0      0     3    0           0      0    0
# manhattan         0      0     0    0           3      0    0
# not_NYC           0      0     0    0           0      2    1
# queens            2      0     0    0           0      0    0
# staten_island     0      3     0    0           0      0    0
# the_bronx         0      0     0    3           0      0    0
# <NA>              0      0     0    0           0      1    0

34) We can label table columns using the function dimnames().

table_subset <- table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text) 
dimnames(table_subset)
#[[1]]
#[1] "brooklyn"      "manhattan"     "not_NYC"       "queens"        "staten_island" "the_bronx"    
#[[2]]
#[1] "apple"       "banana"      "grape"       "pear"        "pomegranate" "potato"     

Labels are unnamed (i.e., no name associated to [[1]] or [[2]]). However, we can name them accordingly.

names(dimnames(table_subset)) <- c("my_boroughs", "my_text")
table_subset
#               my_text
#my_boroughs     apple banana grape pear pomegranate potato
# brooklyn          0      0     3    0           0      0
# manhattan         0      0     0    0           3      0
# not_NYC           0      0     0    0           0      2
# queens            2      0     0    0           0      0
# staten_island     0      3     0    0           0      0
# the_bronx         0      0     0    3           0      0

35) You can also get proportions or percentages for the tabulations in your table object with the function prop.table().

prop.table(table_subset)
#               my_text
#my_boroughs     apple banana  grape   pear pomegranate potato
# brooklyn      0.0000 0.0000 0.1875 0.0000      0.0000 0.0000
# manhattan     0.0000 0.0000 0.0000 0.0000      0.1875 0.0000
# not_NYC       0.0000 0.0000 0.0000 0.0000      0.0000 0.1250
# queens        0.1250 0.0000 0.0000 0.0000      0.0000 0.0000
# staten_island 0.0000 0.1875 0.0000 0.0000      0.0000 0.0000
# the_bronx     0.0000 0.0000 0.0000 0.1875      0.0000 0.0000

This function also works with vectors

## Let's find the proportion of names within a vector of 'my_boroughs'
prop.table(table(my_dataframe_5$my_boroughs))
#     brooklyn     manhattan       not_NYC        queens staten_island     the_bronx 
#    0.1764706     0.1764706     0.1764706     0.1176471     0.1764706     0.1764706

proportions_my_boroughs <- as.numeric(prop.table(table(my_dataframe_5$my_boroughs)))
proportions_my_boroughs
#[1] 0.1764706 0.1764706 0.1764706 0.1176471 0.1764706 0.1764706

str(proportions_my_boroughs)
#num [1:6] 0.176 0.176 0.176 0.118 0.176 ...

## Let's assign the same names as in the table
names(proportions_my_boroughs) <- names(table(my_dataframe_5$my_boroughs))
proportions_my_boroughs
#     brooklyn     manhattan       not_NYC        queens staten_island     the_bronx 
#    0.1764706     0.1764706     0.1764706     0.1176471     0.1764706     0.1764706

## This vector should add up to 1 (i.e., 100%)
sum(proportions_my_boroughs)
#[1] 1

3.6 Lists

36) A list object is also a very flexible data structure in the R environment. You can use this to collect pretty much any of the previous objects (e.g., scalars, vectors, matrices, data frames, functions). To build a list we use the function list().

# We can put a collection of diverse vector
my_list <- list(c(1,2,3), my_dataframe, table_subset) 
my_list
#[[1]]
#[1] 1 2 3
#[[2]]
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green
#[[3]]
#               my_text
#my_boroughs     apple banana grape pear pomegranate potato
# brooklyn          0      0     3    0           0      0
# manhattan         0      0     0    0           3      0
# not_NYC           0      0     0    0           0      2
# queens            2      0     0    0           0      0
# staten_island     0      3     0    0           0      0
# the_bronx         0      0     0    3           0      0

37) You can also create an empty list is useful to fill with elements as desired by the user.

my_list_to_fill <- list()
my_list_to_fill
#list()
my_list_to_fill[[1]] <- "one"
my_list_to_fill[[2]] <- 1:10
my_list_to_fill[[3]] <- mean(my_list_to_fill[[2]]) # mean of an already an element of the list
my_list_to_fill
#[[1]]
#[1] "one"
#[[2]]
#[1]  1  2  3  4  5  6  7  8  9 10
#[[3]]
#[1] 5.5

38) Names can be assigned to list elements and also can select those by using the corresponding name assigned.

my_list <- list(some_numbers = c(1,2,3), a_data_frame = my_dataframe)
my_list
#$some_numbers
#[1] 1 2 3
#$a_data_frame
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green
my_list[["a_data_frame"]]
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green

39) You can pull elements of a list by using its indices, and even an specific element within a list element.

#fist element of list
my_list[[1]]
#[1] 1 2 3
#get the third element of the first list 
my_list[[1]][3]
#[1] 3

40) Removing elements from a list can be done by assigning NULL to the selected element

my_list
#$some_numbers
#[1] 1 2 3
#$a_data_frame
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green
my_list[[2]] <- NULL
my_list
#$some_numbers
#[1] 1 2 3

41) You can also flatten a list object (i.e., unlist or transform) into a vector by using the functionunlist().

my_list <- list(some_numbers = c(1,2,3), a_data_frame = my_dataframe)
my_list
#$some_numbers
#[1] 1 2 3
#$a_data_frame
#  my_numbers     my_text   my_boroughs my_colors
#1          1       apple        queens       red
#2          2        pear     the_bronx    orange
#3          3       grape      brooklyn      blue
#4          4 pomegranate     manhattan     black
#5          5      banana staten_island     green
my_flatten_list <- unlist(my_list)
my_flatten_list
#            some_numbers1             some_numbers2             some_numbers3  a_data_frame.my_numbers1  a_data_frame.my_numbers2 
#                      "1"                       "2"                       "3"                       "1"                       "2" 
# a_data_frame.my_numbers3  a_data_frame.my_numbers4  a_data_frame.my_numbers5     a_data_frame.my_text1     a_data_frame.my_text2 
#                     "3"                       "4"                       "5"                   "apple"                    "pear" 
#   a_data_frame.my_text3     a_data_frame.my_text4     a_data_frame.my_text5 a_data_frame.my_boroughs1 a_data_frame.my_boroughs2 
#                 "grape"             "pomegranate"                  "banana"                  "queens"               "the_bronx" 
#a_data_frame.my_boroughs3 a_data_frame.my_boroughs4 a_data_frame.my_boroughs5   a_data_frame.my_colors1   a_data_frame.my_colors2 
#              "brooklyn"               "manhattan"           "staten_island"                     "red"                  "orange" 
# a_data_frame.my_colors3   a_data_frame.my_colors4   a_data_frame.my_colors5 
#                  "blue"                   "black"                   "green" 

Next, we will overview special data structures relevant for extremely long tables and bioinformatics (nucleotide/amino acid sequences). These data structures are usually developed and managed within specific R-packages.

3.7 Data Tables

42) A data.table is an improved version of a data frame and usually faster to analyze and filter extremely large datasets. To access this data structure requires to install or load R-package data.table. For details more, you can see this vignette.

# if you do not have installed this R-package
install.packages("data.table") 
library(data.table)
input_csv_file <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
flights <- fread(input_csv_file)
flights
#        year month day dep_delay arr_delay carrier origin dest air_time distance hour
#     1: 2014     1   1        14        13      AA    JFK  LAX      359     2475    9
#     2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475   11
#     3: 2014     1   1         2         9      AA    JFK  LAX      351     2475   19
#     4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035    7
#     5: 2014     1   1         2         1      AA    JFK  LAX      350     2475   13
#    ---                                                                              
#253312: 2014    10  31         1       -30      UA    LGA  IAH      201     1416   14
#253313: 2014    10  31        -5       -14      UA    EWR  IAH      189     1400    8
#253314: 2014    10  31        -8        16      MQ    LGA  RDU       83      431   11
#253315: 2014    10  31        -4        15      MQ    LGA  DTW       75      502   11
#253316: 2014    10  31        -5         1      MQ    LGA  SDF      110      659    8
str(flights)
#Classes ‘data.table’ and 'data.frame': 253316 obs. of  11 variables:
#$ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
#$ month    : int  1 1 1 1 1 1 1 1 1 1 ...
#$ day      : int  1 1 1 1 1 1 1 1 1 1 ...
#$ dep_delay: int  14 -3 2 -8 2 4 -2 -3 -1 -2 ...
#$ arr_delay: int  13 13 9 -26 1 0 -18 -14 -17 -14 ...
#$ carrier  : chr  "AA" "AA" "AA" "AA" ...
#$ origin   : chr  "JFK" "JFK" "JFK" "LGA" ...
#$ dest     : chr  "LAX" "LAX" "LAX" "PBI" ...
#$ air_time : int  359 363 351 157 350 339 338 356 161 349 ...
#$ distance : int  2475 2475 2475 1035 2475 2454 2475 2475 1089 2422 ...
#$ hour     : int  9 11 19 7 13 18 21 15 15 18 ...
#- attr(*, ".internal.selfref")=<externalptr> 
class(flights)
#[1] "data.table" "data.frame"
nrow(flights) # get number of rows of data.table
#[1] 253316

3.8 Tibbles

43) A tibble is also a modified form of a data frame used in the tidyverse family of R-packages. They are a more simplified form of data frames and used by some data management R-packages, e.g., readr, where they parse (i.e., import data or read data from a file) into the R environment. I am not a favorite of tibbles as it forces user to orbit around the ‘tidyverse’ for most things that a data frame can handle.

## install or load R-package 'readr' which is a powerful data parser for large text (tab or csv) files
install.packages('readr')
library(readr)
reader_file_path <- readr_example("mtcars.csv") # this a common long dataset of car specifications
my_tibble <- read_delim(file = reader_file_path,
                       delim = ",")
#Rows: 32 Columns: 11                                                                                                                                                     
#── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
#Delimiter: ","
#dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#
#ℹ Use `spec()` to retrieve the full column specification for this data.
#ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## use function spec() to extract the full column specification from a tibble created by readr.

spec(my_tibble)
#cols(
#  mpg = col_double(),
#  cyl = col_double(),
#  disp = col_double(),
#  hp = col_double(),
#  drat = col_double(),
#  wt = col_double(),
#  qsec = col_double(),
#  vs = col_double(),
#  am = col_double(),
#  gear = col_double(),
#  carb = col_double()
#)

my_tibble
# A tibble: 32 x 11
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
# 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
# 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
# 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
# 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
# 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
# 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
# 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
# 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

3.9 DNAStringSets, RNAStringSets and AAStringSets

44) A XStringSet object is a special and flexible type of data frame like that contains sequences with different flavor: DNAStringSet, RNAStringSet or AAStringSet. To access this data structure requires to install or load R-package Biostrings. For details more, you can see the associated vignettes in that R-package site. We will return to this object when explore nucleotide and amino acid sequence manipulation.

For this introduction, we also need to install another powerful R-package rentrez. This package provides an R interface to the NCBIs EUtils API, allowing users to search databases like GenBank and PubMed, process the results of those searches and pull data into their R sessions.

# if you need to install R-packages 'Biostrings' and 'rentrez'
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# remember that you might not need to update all other packages (type n)

BiocManager::install("Biostrings") 

# if you need to install “rentrez”

install.packages("rentrez")

# load these libraries

library(Biostrings)
library(rentrez)

For this example, we will download some nucleotide sequences from NCBI from a charismatic frog, Allobates kingsburyi.

froggy_name <- "Allobates kingsburyi[Organism]"
froggy_seq_IDs <- entrez_search(db="nuccore", term=froggy_name)
# revising the structure of 'froggy_seq_IDs' that there are 17 sequences in NCBI nuccore database
str(froggy_seq_IDs)
#List of 5
#$ ids             : chr [1:17] "1845966712" "1248341807" "328728168" "328728030" ...
#$ count           : int 17
#$ retmax          : int 17
#$ QueryTranslation: chr "\"Allobates kingsburyi\"[Organism]"
#$ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
#- attr(*, "class")= chr [1:2] "esearch" "list"
froggy_seqs_fasta <- entrez_fetch(db="nuccore", id=froggy_seq_IDs$ids, rettype="fasta")
froggy_seqs_fasta
#[1] ">MT524123.1 Allobates kingsburyi voucher QCAZA68477 large subunit ribosomal RNA gene, partial sequence; 
#mitochondrial\nCCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGCCGCGGTATCCTAACCGTGC\nGAAGGTAGCGTAATCACTTGTCCTT
#TAAATGAGGACTAGTATGAACGGCTTCACGAAGGCTATGCTGTCT\nCCTTTATCTAATCAGTTAAACTAATCTCCCCGTGAAGAAGCGGGGATACACCTATAAGACGAGAA
#...
cat(froggy_seqs_fasta)
#>MT524123.1 Allobates kingsburyi voucher QCAZA68477 large subunit ribosomal RNA gene, partial sequence; mitochondrial
#CCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGCCGCGGTATCCTAACCGTGC
#GAAGGTAGCGTAATCACTTGTCCTTTAAATGAGGACTAGTATGAACGGCTTCACGAAGGCTATGCTGTCT
#CCTTTATCTAATCAGTTAAACTAATCTCCCCGTGAAGAAGCGGGGATACACCTATAAGACGAGAAGACCC
#TATGGAGCTTTAAATACTTTAAAACACCTGAATCTGACACTAGAAACTTCCAGAAAACTTTATTTAACAT
#ATCACTTTGTTTTAAACTTTAGGTTGGGGTGACCACGGAGAAAAAACCAACCTCCACGTAGAATGAAATT
#TTCTTTCTAAGCGATAAGCTACATCTTTATGCATCAATACATTGACCTAAATTGACCCAATTTTTTGATC
#AACGAAC
#
#>MF580102.1 Allobates kingsburyi nicotinic acetylcholine receptor beta-2 (chrnb2) gene, partial cds
#ATGACGGTTCTCCTCCTCCTCCTGCACCTCAGCCTGTTCGGCCTGGTCACCAGGAGTATGGGCACGGACA
#CCGAGGAGCGGCTCGTGGAATTCCTGCTGGACCCGTCCCAGTACAACAAGCTGATCCGGCCCGCCACCAA
#TGGATCCGAGCAGGTCACCGTCCAGCTGATGGTATCTCTGGCCCAGCTGATCAGCGTGCACGAGCGGGAG
#...

We can save these sequences in a text file in your working directory.

# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/my_working_directory")
write(froggy_seqs_fasta, "my_froggy_seqs_fasta.txt")

Now we can import the nucleotide sequences from this file that are in fasta format into R.

my_Biostrings_set <- readDNAStringSet(filepath = "~/Desktop/Teach_R/my_working_directory/my_froggy_seqs_fasta.txt", 
                                         format = "fasta")
my_Biostrings_set
#A DNAStringSet instance of length 17
#    width seq                                                                                                      names               
# [1]   427 CCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGC...TATGCATCAATACATTGACCTAAATTGACCCAATTTTTTGATCAACGAAC MT524123.1 Alloba...
# [2]   899 ATGACGGTTCTCCTCCTCCTCCTGCACCTCAGCCTGTTCGGCCTGGTCACC...CCCCCGACGTCCCTGGACGTCCCGCTCGTCGGCAAGTACCTGATGTTCAC MF580102.1 Alloba...
# [3]  4881 AAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATGCAA...CCTTTGTTTACTTCCTATCTCCCCATCCCTTCTCTGCCTGCTCAGAAACT HQ290963.1 Alloba...
# [4]   576 GTACATCATAATGTGAGCAGATGGGAAAGCTTTGATGTCACACCAGCTATT...TCTACCTTGATGAAAATGAAAAAGTTGTTTTGAAAAACTATCAAGACATG HQ291024.1 Alloba...
# [5]   510 AACTCCCCTTCAGGTTCACAATTTCCCTTCAGCGGCATTGACGACCGGGAA...TGCTTCAATGGGAGCATGAAATTCAGAAGCTCACGGGTGACGAGAACTTC HQ290901.1 Alloba...
#...   ... ...
#[13]  2389 AGGCTTGGTCCTAACCTTGAAGTCAGTTACTAATTAATATACACATGCAAG...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342528.1 Alloba...
#[14]  2392 TAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATGC...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342527.1 Alloba...
#[15]  2393 TTAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATG...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342526.1 Alloba...
#[16]  2457 AAAGTTCTCCAACATAAAGGCTTGGTCCTAACCTTGAAGTCAGTTACTAAT...GTTCGTTTGTTCAACGATTAAAATCCTACGTGATCTGAGTTCAGACCGGA AY364550.1 Colost...
#[17]  2446 ATTAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACAT...TTCGTTTGTTCAACGATTAAAATCCTACGTGATCTGAGTTCAAGACCGGA AY364549.1 Colost...

Next, we will download some amino acid sequences from NCBI from the same frog species, Allobates kingsburyi.

froggy_name <- "Allobates kingsburyi[Organism]"
froggy_AA_IDs <- entrez_search(db="protein", term=froggy_name)
# Entrez search result with 11 hits (object contains 11 IDs and no web_history object)
# Search term (as translated):  "Allobates kingsburyi"[Organism] 
str(froggy_AA_IDs)
#List of 5
#$ ids             : chr [1:11] "1248341808" "328728170" "328728169" "328728031" ...
#$ count           : int 11
#$ retmax          : int 11
#$ QueryTranslation: chr "\"Allobates kingsburyi\"[Organism]"
#$ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
#- attr(*, "class")= chr [1:2] "esearch" "list"
froggy_AA_fasta <- entrez_fetch(db="protein", id=froggy_AA_IDs$ids, rettype="fasta")
froggy_AA_fasta
#[1] ">ATG31804.1 nicotinic acetylcholine receptor beta-2, partial [Allobates kingsburyi]\nMTVLLLLLHLSLFGLV
#TRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQVTVQLMVSLAQLISVHERE\nQIMTTNVWLTQEWXXXXXXXXXXXXXXXXXXXXXXXXXWLPDVVLYNNADGMY
#EVSFYSNAVVSHDGSIF\nWLPPAIYKSACKIEVKHFPFDQQNCTMKFRSWTYDRTELDLVLKSDVASLDDFTPSGEWDIIALPGRRNE\nNPEDSTYVDITYDFIIRRKPL
#...
cat(froggy_AA_fasta)
#>ATG31804.1 nicotinic acetylcholine receptor beta-2, partial [Allobates kingsburyi]
#MTVLLLLLHLSLFGLVTRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQVTVQLMVSLAQLISVHERE
#QIMTTNVWLTQEWXXXXXXXXXXXXXXXXXXXXXXXXXWLPDVVLYNNADGMYEVSFYSNAVVSHDGSIF
#WLPPAIYKSACKIEVKHFPFDQQNCTMKFRSWTYDRTELDLVLKSDVASLDDFTPSGEWDIIALPGRRNE
#NPEDSTYVDITYDFIIRRKPLFYTINLIIPCILITSLAILVFYLPSDCGEKMTLCISVLLALTVFLLLIS
#KIVPPTSLDVPLVGKYLMFT
#
#>AEB39272.1 NADH dehydrogenase subunit 2 (mitochondrion) [Allobates kingsburyi]
#MNPYALFLIISSLALGTSIAVSSFHWILAWIGLEINTLAIIPLMTKNPHPRSIEAATKYFLTQAAASSLI
#LFSCALNAWLLGEWTINNLMSPASMIFLSIALSTKLGLAPFHFWLPEVLQGLTLQTGWILSTWQKLAPLA
#ILFQLSQSINLLLMMSMGLLSILVGGWGGINQNQIRKILAFSSIAHLGWMITILKISPQLSLLNFILYII
#MTSALFYTFIMIDSTNISHLATTWTKIPTLTALSLMSLLSLSGLPPLTGFLPKWLIAQELINQNLIILPF
#LMLMLTLLALFFYLRLTYTISLTMAPNSTSSVSLWYQKKKNNLTIFILLTLCLLPISPSLLCLL

We can save these AA sequences in a text file in your working directory.

# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/my_working_directory")
write(froggy_AA_fasta, "my_froggy_AA_fasta.txt")

Now we can import the amino acid sequences of this file that are in fasta format.

my_AA_Biostrings_set <- readAAStringSet(filepath = "~/Desktop/Teach_R/my_working_directory/my_froggy_AA_fasta.txt", 
                                         format = "fasta")
my_AA_Biostrings_set
#A  AAStringSet instance of length 11
#      width seq                                                                                                      names               
# [1]   300 MTVLLLLLHLSLFGLVTRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQ...VFYLPSDCGEKMTLCISVLLALTVFLLLISKIVPPTSLDVPLVGKYLMFT ATG31804.1 nicoti...
# [2]   344 MNPYALFLIISSLALGTSIAVSSFHWILAWIGLEINTLAIIPLMTKNPHPR...RLTYTISLTMAPNSTSSVSLWYQKKKNNLTIFILLTLCLLPISPSLLCLL AEB39272.1 NADH d...
# [3]   320 LNIFTLTQSLCYMVPILLAVAFLTLLERKVLGYMQHRKGPNVIGPTGLLQP...LFLWVRASYPRFRYDQLMHLVWKNFLPMTLALTIWFITFPIIFLFSPPIL AEB39271.1 NADH d...
# [4]   192 VHHNVSRWESFDVTPAIIRWIAHRQPNHGFVVEVTQLDCEKNVTKRHVRIS...STNHAIVQTLVNSVNSNIPKACCVPTELSAISMLYLDENEKVVLKNYQDM AEB39195.1 bone m...
# [5]   170 NSPSGSQFPFSGIDDRENWPIVFYNRTCQCQGNFMGYNCGDCKFXFTGXNC...YASRDAFLEGDLVWQNIDFAHEAPAFLPWHRFFLLQWEHEIQKLTGDENF AEB39135.1 tyrosi...
# ...   ... ...
# [7]   192 TTMDKRNLPESSMNSLFIKLMQADLLKNKIPKQVVNAKEIKQQSTIPKAEI...VTNKSNAIDIRGHQVAVLGEIKTGNSPVKQYFYETRCKDARPVKSGCRGI AEB39015.1 neurot...
# [8]   414 TIKKPNGETTKTTVRIWNETVSNLTLMALGSSAPEILLSVIEVCGHNFQAG...GIIDDDIFEEDENFLVHLSNVRVNAETTEVNFESNHVTSLACLGSPSTAT AEB38955.1 sodium...
# [9]   309 CIGLISVNGRMRNNMKAGSSPNSVSSSPTNSAITQLRHKLENGKPLGMNES...PIPLHQHERYLCKMNEEIKAVLQPSENLILNKQGMFAEKQALLLSSVLSE AEB38895.1 zinc f...
#[10]   201 VRGQSGLAYPGLRTHGTLESIGGPMSSSRGGGLPSLTDTFEHVIEELLEEE...QLKQYFYETKCNPMGYMKEGCRGIDKRYWNSQCRTTQSYVRALTMDSKKK AEB38835.1 brain-...
#[11]   230 GLCLIAQIITGLFLAMHYTADTTMAFSSIAHICRDVNNGWLLRSLHANGAS...VPFHAYFSYKDALGFIILLVLLSLLSLFSPNLLGDPDNFTPANPLVTPPH AEB33649.1 cytoch...