Session 3 – Basic Data Structures in R
Managing data in R is at the core of its implementation in bioinformatics analyses. You can input your data by typing these directly into the R console, but for most practical uses you likely input those dataset from a file in your computer. Please refer to section 2.6 Importing and exporting data for specific examples. However, we will work in this chapter with both typed data and an imported dataset from the LINK TO FILES of BIO/BIT 209 to your computer as indicated in section 2.5 Downloading data files (GitHub).
# Here is an exemplar dataset to be used here:
setwd("~/Desktop/Teach_R/class_datasets")
<- read.table (file = "~/Desktop/Teach_R/class_datasets/mtcars2_file_tab.txt",
my_imported_dataset header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
## This dataset will be imported as a data.frame
str(my_imported_dataset)
#'data.frame': 32 obs. of 12 variables:
# $ cars: chr "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
# $ disp: num 160 160 108 258 360 ...
# $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
# $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec: num 16.5 17 18.6 19.4 17 ...
# $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
# $ am : int 1 1 1 0 0 0 0 0 0 0 ...
# $ gear: int 4 4 4 3 3 3 3 4 4 4 ...
# $ carb: int 4 4 1 1 2 1 4 2 2 4 ...
head(my_imported_dataset)
# cars mpg cyl disp hp drat wt qsec vs am gear carb
#1 Mazda_RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#2 Mazda_RX4_Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#3 Datsun_710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#4 Hornet_4_Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#5 Hornet_Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: You are strongly encouraged to import your project dataset(s) into R to explore and transform them into the different types of data structures below.
3.1 Scalars and vectors
We already introduced vectors and scalars on the section “Your First R Session”. We will describe with more detail vectors and continue exploring these structures and functions that apply to them.
1) A numeric vector contains numbers. Notice the functionc()
that combine such values into a vector.
Here is a typed example:
<- c(1,3,45,56,1)
my_numeric_vector
my_numeric_vector#[1] 1 3 45 56 1
Here is a vector derived from my_imported_dataset
:
# Notice that get the data of one column you add the '$' follow by the name of column'
<- my_imported_dataset$disp
my_numeric_car_displacement
my_numeric_car_displacement#[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1
#[22] 318.0 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
The function str()
can help to identify its numeric structure. Notice the num
text in the output that this indicates that is a vector of numeric values.
str(my_numeric_vector)
#num [1:5] 1 3 45 56 1
str(my_numeric_car_displacement)
#num [1:32] 160 160 108 258 360 ...
Likewise, the function class()
is also useful to characterize a vector.
class(my_numeric_vector)
#[1] "numeric"
class(my_numeric_car_displacement)
#[1] "numeric"
2) A vector of character strings contains characters. Note the quotations to contain text strings.
Here is a typed example:
<- c("my", "bioinformatics", "class")
my_character_vector
my_character_vector#[1] "my" "bioinformatics" "class"
str(my_character_vector)
#chr [1:3] "my" "bioinformatics" "class"
class(my_character_vector)
#[1] "character"
Here is a vector derived from my_imported_dataset
:
<- my_imported_dataset$cars
my_character_car_names
my_character_car_names#[1] "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" "Hornet_Sportabout"
# [6] "Valiant" "Duster_360" "Merc_240D" "Merc_230" "Merc_280"
#[11] "Merc_280C" "Merc_450SE" "Merc_450SL" "Merc_450SLC" "Cadillac_Fleetwood"
#[16] "Lincoln_Continental" "Chrysler_Imperial" "Fiat_128" "Honda_Civic" "Toyota_Corolla"
#[21] "Toyota_Corona" "Dodge_Challenger" "AMC_Javelin" "Camaro_Z28" "Pontiac_Firebird"
#[26] "Fiat_X1_9" "Porsche_914_2" "Lotus_Europa" "Ford_Pantera_L" "Ferrari_Dino"
#[31] "Maserati_Bora" "Volvo_142E"
str(my_character_car_names)
#chr [1:32] "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" "Hornet_Sportabout" "Valiant" "Duster_360" "Merc_240D" ...
class(my_character_car_names)
#[1] "character"
3) Notice that if you include numbers with character elements, your vector will consider such numbers as characters.
<- c("my", "bioinformatics", "class", 3.141593)
my_mix_vector
my_mix_vector#[1] "my" "bioinformatics" "class" "3.141593"
str(my_mix_vector)
#chr [1:4] "my" "bioinformatics" "class" "3.141593"
class(my_mix_vector)
#[1] "character"
4) A factor vector is similar to a character vector, but each unique element of this vector can be assigned a level. To do this, we use the function factor()
. This factor vectors can be used in statistical analyses where discrete groups can be defined by a level.
Here is a typed example:
<- c("white", "black", "white", "white", "black")
my_factor_vector <- factor(my_factor_vector)
my_factor_vector
my_factor_vector#[1] white black white white black
#Levels: black white
str(my_factor_vector)
#Factor w/ 2 levels "black","white": 2 1 2 2 1
class(my_factor_vector)
#[1] "factor"
You can also convert any numeric vector to a factor vector.
<- c(1,0,1,1,1,0)
my_factor_vector <- factor(my_factor_vector)
my_factor_vector
my_factor_vector#[1] 1 0 1 1 1 0
#Levels: 0 1
str(my_factor_vector)
#Factor w/ 2 levels "0","1": 2 1 2 2 2 1
class(my_factor_vector)
#[1] "factor"
Here is a factor vector derived from my_imported_dataset
and then appended to this same dataset:
<- my_imported_dataset$vs
my_vs_vector
my_vs_vector#[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
str(my_vs_vector)
#int [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## This is a integer vector (like a numeric vector), we can transform this into a factor vector
<- factor(my_vs_vector)
my_vs_vector_as_factor
my_vs_vector_as_factor#[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
#Levels: 0 1
str(my_vs_vector_as_factor)
#Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## We can add this factor vector to our data.frame
$vs_factor <- my_vs_vector_as_factor
my_imported_datasethead(my_imported_dataset)
# cars mpg cyl disp hp drat wt qsec vs am gear carb vs_factor
#1 Mazda_RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0
#2 Mazda_RX4_Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0
#3 Datsun_710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#4 Hornet_4_Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
#5 Hornet_Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
str(my_imported_dataset)
#'data.frame': 32 obs. of 13 variables:
# $ cars : chr "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
# $ disp : num 160 160 108 258 360 ...
# $ hp : int 110 110 93 110 175 105 245 62 95 123 ...
# $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
# $ qsec : num 16.5 17 18.6 19.4 17 ...
# $ vs : int 0 0 1 1 0 1 0 1 1 1 ...
# $ am : int 1 1 1 0 0 0 0 0 0 0 ...
# $ gear : int 4 4 4 3 3 3 3 4 4 4 ...
# $ carb : int 4 4 1 1 2 1 4 2 2 4 ...
# $ vs_factor: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
5) A logical (Boolean) vector contains TRUE or FALSE values. These vectors and scalars are hugely important in any process that require some control flow during a set functions and calculations that require to define alternative processes. In other words, if the evaluation of a logical test is TRUE then the do some calculation to the result, but if FALSE do this other process.
# a numeric vector with number from 1 to 10
<- 1:10
my_numeric_vector
my_numeric_vector#[1] 1 2 3 4 5 6 7 8 9 10
Then, we test this vector for the condition if each of its element is more than 5 using the function ifelse()
. This function is very useful and it has three components: The first part is a logical test x > 5
(i.e., if x
more than 5 will be TRUE otherwise FALSE), a second part will provide the output for the x > 5
test is TRUE (in this case assign the logical value of TRUE) and a third part will provide the output for the x > 5
test is FALSE (in this case assign the logical value of FALSE).
<- ifelse(my_numeric_vector > 5, TRUE, FALSE)
my_logical_vector
my_logical_vector# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(my_logical_vector)
#logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
class(my_logical_vector)
#"logical"
Here is an alternative for to get the same logical vector
<- my_numeric_vector > 5
my_logical_vector
my_logical_vector#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Here is a logical vector derived from my_imported_dataset
and then appended to this same dataset:
<- my_imported_dataset$am
my_am_vector
my_am_vector#[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
str(my_am_vector)
#int [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## This is a integer vector (like a numeric vector), we can transform this into a logical vector with ifelse test
<- ifelse(my_am_vector == 1, TRUE, FALSE)
my_logical_am_vector
my_logical_am_vector#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
#[22] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str(my_logical_am_vector)
#logi [1:32] TRUE TRUE TRUE FALSE FALSE FALSE ...
## We can add this logical vector to our data.frame
$am_logical <- my_logical_am_vector
my_imported_datasethead(my_imported_dataset)
# cars mpg cyl disp hp drat wt qsec vs am gear carb vs_factor am_logical
#1 Mazda_RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 TRUE
#2 Mazda_RX4_Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 TRUE
#3 Datsun_710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 TRUE
#4 Hornet_4_Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1 FALSE
#5 Hornet_Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0 FALSE
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1 FALSE
str(my_imported_dataset)
#'data.frame': 32 obs. of 14 variables:
#$ cars : chr "Mazda_RX4" "Mazda_RX4_Wag" "Datsun_710" "Hornet_4_Drive" ...
#$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#$ cyl : int 6 6 4 6 8 6 8 4 4 6 ...
#$ disp : num 160 160 108 258 360 ...
#$ hp : int 110 110 93 110 175 105 245 62 95 123 ...
#$ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#$ qsec : num 16.5 17 18.6 19.4 17 ...
#$ vs : int 0 0 1 1 0 1 0 1 1 1 ...
#$ am : int 1 1 1 0 0 0 0 0 0 0 ...
#$ gear : int 4 4 4 3 3 3 3 4 4 4 ...
#$ carb : int 4 4 1 1 2 1 4 2 2 4 ...
#$ vs_factor : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
#$ am_logical: logi TRUE TRUE TRUE FALSE FALSE FALSE ...
6) There are also special cases of vector elements that are useful, but they can also be confusing. A NULL
represents the null or an empty object in R and it can be on its own, but it cannot be with other elements in the same vector.
<- c(1,2,3,4, NULL)
my_numeric_vector
my_numeric_vector#[1] 1 2 3 4
A NA
element represents a missing value in R. This element can be in a vector and updated in other R objects.
<- c(1,2,3,4, NA)
my_numeric_vector
my_numeric_vector#[1] 1 2 3 4 NA
7) We can compare vectors or a vector against an scalar (i.e., an atomic quantity or object that can hold only one value at a time) using different logical operators and this will result in logical vector containing TRUE or FALSE values (also known as Boolean values).
Note: Boolean values can serve as switches (ON/OFF) in conditional statements.
<- 1:10
my_numbers #[1] 1 2 3 4 5 6 7 8 9 10
## Here is an example using our imported dataset
$gear
my_imported_dataset#[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
The operator to test for equality ==
will determine if the values in the vector are equal to some value.
== 2
my_numbers #[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
$gear == 3
my_imported_dataset#[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#[22] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The operator to test for inequality is !=
. Notice !
which can we used in many logical function to negate (i.e., test for the opposite that function will try to determine as TRUE).
!= 2
my_numbers #[1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Here is an example using our imported dataset
$gear != 3
my_imported_dataset# [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
#[22] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The operator to test for less than <
.
< 2
my_numbers #[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
$gear < 4
my_imported_dataset#[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#[22] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The operator to test for less or equal than <=
.
<= 2
my_numbers #[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Here is an example using our imported dataset
$gear <= 4
my_imported_dataset#1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[22] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
The operator to test for more than >
.
> 2
my_numbers #[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Here is an example using our imported dataset
$gear > 4
my_imported_dataset#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[22] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
The operator to test for more or equal than >=
.
>= 2
my_numbers #[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Here is an example using our imported dataset
$gear >= 4
my_imported_dataset#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
#22] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
You can also test more complex vectors, as long as they have the same number of elements.
<- c(2,10,4) # three elements
a <- 2:4 # three elements
b == b
a # [1] TRUE FALSE TRUE
Here is a comparison from vectors derived from our imported dataset:
## we can compare if vs versus am values are the same, the test is one set of elements at a time
$vs == my_imported_dataset$am
my_imported_dataset#[1] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#[22] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
8) You can select specific elements of a vector by using their inherent index by its position on the set.
# sequence of numbers between 1 and 10 every 2 numbers
<- seq(1,10,2)
my_vector
my_vector#[1] 1 3 5 7 9
if you want the third element of ‘my_vector’ then you use [3]
.
<- my_vector[3]
element_3
element_3#[1] 5
9) If you want to delete an element using index, then use a minus -
before the index that corresponds to the element to remove.
<- my_vector[-3]
my_vector_without_element_3
my_vector_without_element_3#[1] 1 3 7 9
10) We can also use a vector of indices to select multiple elements within a given vector (e.g., my_vector
).
c(1,2,5)] # select elements 1, 2 and 5
my_vector[#[1] 1 3 9
11) We can also use a logical operator to select elements that meet condition.
# select elements that are more than 3
> 3]
my_vector[my_vector #[1] 5 7 9
12) Some other examples: Select even or odd elements from a vector.
my_numbers#[1] 1 2 3 4 5 6 7 8 9 10
# Select even numbers
%% 2 == 0]
my_numbers[my_numbers #[1] 2 4 6 8 10
# Select numbers divisible by 3
%% 3 == 0]
my_numbers[my_numbers #[1] 3 6 9
13) We can test if an element matches a set of terms using logical %in%
that return TRUE if the left operand occurs in the right operand.
<- c("juan", "c", "santos")
my_names <- c("peter", "juan", "randy", "david", "leeann")
name_key %in% my_names
name_key #> [1] FALSE TRUE FALSE FALSE FALSE
%in% my_names]
name_key[name_key #[1] "juan"
If you want the opposite (i.e., return those that do not match key terms) then we add a !
(as we did above).
!name_key %in% my_names]
name_key[#[1] "peter" "randy" "david" "leeann"
14) Several standard arithmetic calculations with numeric vectors can also be done and include.
<- 1:10
my_numbers
my_numbers#[1] 1 2 3 4 5 6 7 8 9 10
#Addition
+ 1
my_numbers #[1] 2 3 4 5 6 7 8 9 10 11
#Subtraction
- 1
my_numbers #[1] 0 1 2 3 4 5 6 7 8 9
#Multiplication
* 2
my_numbers #[1] 2 4 6 8 10 12 14 16 18 20
#Division
/ 3
my_numbers #[1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 2.0000000 2.3333333 2.6666667 3.0000000 3.3333333
#Exponentiation
^ 2
my_numbers #[1] 1 4 9 16 25 36 49 64 81 100
#Other functions – I did not add the corresponding results
log(my_numbers)
sqrt(my_numbers)
sin(my_numbers)
15) You can append elements to your vector.
# append 11 to vector
<- c(my_numbers,11)
my_numbers
my_numbers#[1] 1 2 3 4 5 6 7 8 9 10 11
16) You can repeat vectors or elements of vectors using the function rep()
. If you want to repeat the complete vector, for example, you specify the argument times. To repeat the vector c(0, 0, 7)
for three times, use the following code.
rep(c(0, 0, 7), times = 3)
#[1] 0 0 7 0 0 7 0 0 7
You also can repeat every value by specifying the argument each, like this:
rep(c(2, 4, 2), each = 3)
#[1] 2 2 2 4 4 4 2 2 2
You can tell R for each value how often it has to be repeated.
rep(c(0, 7), times = c(4,2))
#[1] 0 0 0 0 7 7
And you can, like in seq()
, use the argument length.out
to tell R how long you want it to be. R will repeat the vector until it reaches that length even if the last repetition is incomplete, like so:
rep(1:3,length.out=7)
#[1] 1 2 3 1 2 3 1
3.2 Matrices
17) A matrix object is defined by elements in rows and columns (i.e., n-rows x m-columns). These behave like vectors with dimensions, but you might need to use matrix algebra for their implementation.
<- matrix(data = c(1,2,3,11,12,13),
my_matrix nrow = 2,
ncol = 3,
byrow = TRUE)
my_matrix # [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 11 12 13
str(my_matrix)
#num [1:2, 1:3] 1 11 2 12 3 13
class(my_matrix)
#[1] "matrix"
+ 1
my_matrix # [,1] [,2] [,3]
#[1,] 2 3 4
#[2,] 12 13 14
* 2
my_matrix # [,1] [,2] [,3]
#[1,] 2 4 6
#[2,] 22 24 26
To do arithmetic functions between matrices they should follow rules of matrix algebra. We will not use these data structures that much.
3.3 Arrays
18) These objects are similar to a matrix, but it can store data in more than 2 dimensions (e.g., x * y * z ). In array objects, you can store several matrices in cube-like organization. For example, an array with dimensions (2, 3, 5) will include 5 rectangular matrices each with 2 rows and 3 columns. These objects are created with the function array()
and takes vectors as input and uses the values in the argument dim
to create the 3D structure of the array.
We will create an array of 5 matrices 2x3 matrices. In other words, each of these matrices has 2 rows and 3 columns, and 5 of them are stacked the cube-like structure of the array object.
# Five input vectors.
<- c(1,2,3)
vector1 <- c(4,5,6,7)
vector2 <- c(8,9,10,11,12)
vector3 <- c(13,14,15,16,17)
vector4 <- c(18,19,20,21,22)
vector5
# Take these vectors as input to the array.
<- array(c(vector1,vector2,vector3,vector4,vector5),dim = c(2,3,5))
my_array
my_array#, , 1
#
# [,1] [,2] [,3]
#[1,] 1 3 5
#[2,] 2 4 6
#
#, , 2
#
# [,1] [,2] [,3]
#[1,] 7 9 11
#[2,] 8 10 12
#
#, , 3
#
# [,1] [,2] [,3]
#[1,] 13 15 17
#[2,] 14 16 18
#
#, , 4
#
# [,1] [,2] [,3]
#[1,] 19 21 1
#[2,] 20 22 2
#
#, , 5
#
# [,1] [,2] [,3]
#[1,] 3 5 7
#[2,] 4 6 8
str(my_array)
#num [1:2, 1:3, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
class(my_array)
#[1] "array"
## Print the second row (2) of the fifth matrix (5) of the array
2,,5]
my_array[#[1] 4 6 8
You can do calculations with these array elements.
## add the matrices 2 and 5
2] + my_array[,,5]
my_array[,,# [,1] [,2] [,3]
#[1,] 10 14 18
#[2,] 12 16 20
## multiply matrices 2 and 5
2] * my_array[,,5]
my_array[,,# [,1] [,2] [,3]
#[1,] 21 45 77
#[2,] 32 60 96
## Use apply to calculate the sum of the rows, which is indicated by c(1), across all the matrices. It will be vector with two elements (i.e., the sums of each of the rows of matrices in the array).
apply(my_array, c(1), sum)
#[1] 137 152
## Use apply to calculate the sum of the columns, which is indicated by c(2), across all the matrices. It will be a vector with three elements because each matrix has three columns.
apply(my_array, c(2), sum)
#[1] 91 111 87
3.4 Data Frames
19) A data frame object is an extremely flexible object in the R environment. Most packages, functions and applications use data frames as object to store, search, transform and filter data. A data frame is similar to a matrix (or an Excel worksheet), but also behaves like a list and a table.
Long and Teetor (2019) provides some useful characteristics of a data frame. To an R programmer: A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors, but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.
Therefore, the data frame flexibility derive from some of its properties:
- The elements (cells) of data frame can be usually numeric, character, logical, or factor
- The columns and rows can have names
- You can call rows and columns to vectors
- You can filter and subset the data frame based on rules, functions and conditions
- You can store results of calculations in elements (cells) of a data frame
- You can append new columns and rows to a data frame
- You can coerce other data structures into data frames (e.g., collect vectors into a data frame, transform from a matrix). More complex data structures (e.g., tibbles, data.tables, biostrings objects) as usually modifications of data frames and can be also be coerced to data frames
- Most data (e.g., CSV or tab delimited data sets) are imported into R as data frames
20) You can build a data frame as follows by typing its contents in the R console (or copy and paste from text editor).
<- data.frame(my_numbers = c(1,2,3,4,5),
my_dataframe my_text = c("apple", "pear", "grape", "pomegranate", "banana"),
my_boroughs = c("queens", "the_bronx", "brooklyn", "manhattan", "staten_island"),
stringsAsFactors = FALSE)
my_dataframe# my_numbers my_text my_boroughs
#1 1 apple queens
#2 2 pear the_bronx
#3 3 grape brooklyn
#4 4 pomegranate manhattan
#5 5 banana staten_island
str(my_dataframe)
#'data.frame': 5 obs. of 3 variables:
#$ my_numbers : num 1 2 3 4 5
#$ my_text : chr "apple" "pear" "grape" "pomegranate" ...
#$ my_boroughs: chr "queens" "the_bronx" "brooklyn" "manhattan" ...
class(my_dataframe)
#[1] "data.frame"
Notice the stringsAsFactors = FALSE
argument while constructing (typing) the data.frame. This is to avoid R assigning variables (columns) as factors.
21) You can obtain the dimension of your data frame (i.e., number of columns and row) with the function dim()
.
# the object 'my_dataframe' has 5 rows and 3 columns
dim(my_dataframe)
#[1] 5 3
22) You can also import txt
files (comma delimited *.csv
files or tab delimited *.txt
files) as data frames. More on this the section how to import data.
## NOTE: remember to update the path to file with your dataset in your computer -- THIS IS EXCLUSIVE TO YOUR COMPUTER AND IT IS NOT THE PATH SHOWN BELOW
<- read.table(file = "~/Desktop/Teach_R/my_dataframe_csv.csv",
my_dataframe header = TRUE,
sep = ",",
stringsAsFactors = FALSE)
<- read.table(file = "~/Desktop/Teach_R/my_dataframe_tab.txt",
my_dataframe header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
Notice that argument sep =
has either a ,
or \t
and this indicates to separate columns using commas or a tabs, respectively.
23) As mentioned, data frames are extremely flexible and you can call columns, rows or specific elements.
This returns a vector of the column named my_ boroughs
.
$my_boroughs
my_dataframe#[1] "queens" "the_bronx" "brooklyn" "manhattan" "staten_island"
Same result by indicating the corresponding column that has my_ boroughs
.
3]
my_dataframe[,#[1] "queens" "the_bronx" "brooklyn" "manhattan" "staten_island"
Same result.
"my_boroughs"]
my_dataframe[,#[1] "queens" "the_bronx" "brooklyn" "manhattan" "staten_island"
This returns a subset data frame with the fourth row.
4,]
my_dataframe[# my_numbers my_text my_boroughs
#4 4 pomegranate manhattan
This returns an element on the first row and third column.
1,3]
my_dataframe[#[1] "queens"
24) Subsetting your data frame by a condition is usually one of the most common function applied to data frames. These conditions are used to filter or extract data from a the data frame.
This returns a data frame for the column named my_boroughs
.
subset(my_dataframe, select = my_boroughs)
# my_boroughs
#1 queens
#2 the_bronx
#3 brooklyn
#4 manhattan
#5 staten_island
This returns a subset data frame if the numbers of column my_numbers
are more or equal to 3.
subset(my_dataframe, my_numbers >= 3)
# my_numbers my_text my_boroughs
#3 3 grape brooklyn
#4 4 pomegranate manhattan
#5 5 banana staten_island
This returns a subset data frame if the column my_text
contains banana
.
subset(my_dataframe, my_text %in% "banana")
# my_numbers my_text my_boroughs
#5 5 banana staten_island
This returns a subset data frame if the column my_boroughs
has the text _island
. Notice that this is a special case application of the function grepl()
where a text pattern _island
is searched in the text (words) elements (even if this not the full word) that is contained in the column my_boroughs
.
subset(my_dataframe, grepl(pattern = "_island", my_dataframe$my_boroughs))
# my_numbers my_text my_boroughs
#5 5 banana staten_island
25) You can append vectors as columns (e.g., variables) to a data frame as long as it has the same number of elements per column (i.e., same number of rows).
$my_colors <- c("red","orange","blue","black","green")
my_dataframe
my_dataframe# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
26) You can append rows to a data frame as long as it is another data frame with the same column names.
<- data.frame(my_numbers = 6,
not_NYC my_text = "potato",
my_boroughs = "not_NYC",
my_colors = "purple",
stringsAsFactors = FALSE)
<- rbind(my_dataframe,not_NYC)
my_dataframe_2
my_dataframe_2# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
#6 6 potato not_NYC purple
27) You can remove columns and rows from a data frame in similar fashion as we did for vectors.
This will remove last row (i.e., row 6) of my_dataframe_2
.
-6,]
my_dataframe_2[# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
This will remove second column (i.e., column 2) of my_dataframe_2
.
-2]
my_dataframe_2[,# my_numbers my_boroughs my_colors
#1 1 queens red
#2 2 the_bronx orange
#3 3 brooklyn blue
#4 4 manhattan black
#5 5 staten_island green
#6 6 not_NYC purple
28) You can also change the names of the columns on a data frame easily using the function names()
.
<- my_dataframe
my_dataframe_for_names #get a vector of names of columns
names(my_dataframe_for_names)
#[1] "my_numbers" "my_text" "my_boroughs" "my_colors"
# to change names, you provide a vector with new names
names(my_dataframe_for_names) <- c("numbers", "fruits", "boroughs", "colors")
my_dataframe_for_names# numbers fruits boroughs colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
You can also change one column name, but you have to indicate where in the order of the list of names of the column.
names(my_dataframe_for_names)[3] <- "the_boroughs"
my_dataframe_for_names# numbers fruits the_boroughs colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
29) You can remove NA
elements of a data frame, but it will remove the row and column that has such values.
Let’s create a data frame that has NA
elements.
<- data.frame (numbers_A = c(1,2,3,4),
my_dataframe_with_NAs numbers_B = c(5,NA,7,NA),
numbers_C = c(9, NA, 11,12),
stringsAsFactors = FALSE)
my_dataframe_with_NAs# numbers_A numbers_B numbers_C
#1 1 5 9
#2 2 NA NA
#3 3 7 11
#4 4 NA 12
We can remove all columns and rows with NA
elements using the function na.omit()
.
na.omit(my_dataframe_with_NAs)
# numbers_A numbers_B numbers_C
#1 1 5 9
#3 3 7 11
We can also remove the row with NA values in a given column (e.g., numbers_C
) using the function complete.cases()
.
complete.cases(my_dataframe_with_NAs[,"numbers_C"]),]
my_dataframe_with_NAs[# numbers_A numbers_B numbers_C
#1 1 5 9
#3 3 7 11
#4 4 NA 12
30) You can get an overview of your data frame with the following functions.
You can get its structure with str()
. The object mtcars
is a build in data set for ‘Motor Trend Car Road Tests’ that is usually used to illustrate examples in R.
<- mtcars
my_long_dataframe str(my_long_dataframe)
#'data.frame': 32 obs. of 11 variables:
#$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#$ disp: num 160 160 108 258 360 ...
#$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#$ qsec: num 16.5 17 18.6 19.4 17 ...
#$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#$ am : num 1 1 1 0 0 0 0 0 0 0 ...
#$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
See the first rows with head()
.
head(my_long_dataframe)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
See the last rows with tail()
.
tail(my_long_dataframe)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
#Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
#Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
#Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
#Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
#Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
You can see the size and dimensions of data frame with dim()
.
#returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
dim(my_long_dataframe)
#[1] 32 11
You can see the number of rows with nrow()
.
nrow(my_long_dataframe)
#[1] 32
You can see the number of columns with ncol()
.
ncol(my_long_dataframe)
#[1] 11
31) You can get an summary of data frame that includes summary statistics about each numeric column (min, max, median, mean, etc.) of the data frame.
# names of columns
names(my_long_dataframe)
#[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
summary(my_long_dataframe)
# mpg cyl disp hp drat wt qsec vs
#Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
#1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
#Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71 Median :0.0000
#Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
#3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
#Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
# am gear carb
#Min. :0.0000 Min. :3.000 Min. :1.000
#1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
#Median :0.0000 Median :4.000 Median :2.000
#Mean :0.4062 Mean :3.688 Mean :2.812
#3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
#Max. :1.0000 Max. :5.000 Max. :8.000
3.5 Tables
32) A table object is similar to a data frame, but it can be used to summarizing or tabulating your data, especially with categorical variables. Usually, you can use a table object to summarize a data frame.
A contingency table is creates using the function table()
to create an object for tabulation of counts and percentages for one or more variables. Missing values NA
are excluded from the counts unless you specify the argument useNA=“ifany”
or useNA=“always”
.
To exemplify a table, we create a version with different number of rows of my_dataframe
and append those together.
<- rbind(my_dataframe,my_dataframe_2, my_dataframe_2[-1,]) my_dataframe_3
Now, we can tabulate values on my_dataframe_3
. For example, a table for number of times each borough is repeated.
table(my_dataframe_3$my_boroughs)
# brooklyn manhattan not_NYC queens staten_island the_bronx
# 3 3 2 2 3 3
A table for number of times the each my_text
element is repeated.
table(my_dataframe_3$my_text)
# apple banana grape pear pomegranate potato
# 2 3 3 3 3 2
A two-way table for my_borough
and my_text
. In this case, we can see in the next table when both vectors concide with the same combination of text terms (e.g., queens
with apple
occur 2 times in the same postion of both vectors).
table(my_dataframe_3$my_boroughs, my_dataframe_3$my_text)
# apple banana grape pear pomegranate potato
# brooklyn 0 0 3 0 0 0
# manhattan 0 0 0 0 3 0
# not_NYC 0 0 0 0 0 2
# queens 2 0 0 0 0 0
# staten_island 0 3 0 0 0 0
# the_bronx 0 0 0 3 0 0
33) When dealing with NA
values, tables can include or exclude these in tabulations.
We can create a data frame to append with NA
.
<- data.frame(my_numbers = c(6,NA),
my_dataframe_4 my_text = c(NA,"potato"),
my_boroughs = c("not_NYC",NA),
my_colors = c(NA,"purple"),
stringsAsFactors = FALSE)
<- rbind(my_dataframe_3,my_dataframe_4) my_dataframe_5
As indicated, we tabulated this my_dataframe_5
. This will by default excluding NA
values.
table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text)
# apple banana grape pear pomegranate potato
# brooklyn 0 0 3 0 0 0
# manhattan 0 0 0 0 3 0
# not_NYC 0 0 0 0 0 2
# queens 2 0 0 0 0 0
# staten_island 0 3 0 0 0 0
# the_bronx 0 0 0 3 0 0
However, we can modify this to tabulated NA
values with the argument useNA = "always"
.
table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text, useNA = "always")
# apple banana grape pear pomegranate potato <NA>
# brooklyn 0 0 3 0 0 0 0
# manhattan 0 0 0 0 3 0 0
# not_NYC 0 0 0 0 0 2 1
# queens 2 0 0 0 0 0 0
# staten_island 0 3 0 0 0 0 0
# the_bronx 0 0 0 3 0 0 0
# <NA> 0 0 0 0 0 1 0
34) We can label table columns using the function dimnames()
.
<- table(my_dataframe_5$my_boroughs, my_dataframe_5$my_text)
table_subset dimnames(table_subset)
#[[1]]
#[1] "brooklyn" "manhattan" "not_NYC" "queens" "staten_island" "the_bronx"
#[[2]]
#[1] "apple" "banana" "grape" "pear" "pomegranate" "potato"
Labels are unnamed (i.e., no name associated to [[1]]
or [[2]]
). However, we can name them accordingly.
names(dimnames(table_subset)) <- c("my_boroughs", "my_text")
table_subset# my_text
#my_boroughs apple banana grape pear pomegranate potato
# brooklyn 0 0 3 0 0 0
# manhattan 0 0 0 0 3 0
# not_NYC 0 0 0 0 0 2
# queens 2 0 0 0 0 0
# staten_island 0 3 0 0 0 0
# the_bronx 0 0 0 3 0 0
35) You can also get proportions or percentages for the tabulations in your table object with the function prop.table()
.
prop.table(table_subset)
# my_text
#my_boroughs apple banana grape pear pomegranate potato
# brooklyn 0.0000 0.0000 0.1875 0.0000 0.0000 0.0000
# manhattan 0.0000 0.0000 0.0000 0.0000 0.1875 0.0000
# not_NYC 0.0000 0.0000 0.0000 0.0000 0.0000 0.1250
# queens 0.1250 0.0000 0.0000 0.0000 0.0000 0.0000
# staten_island 0.0000 0.1875 0.0000 0.0000 0.0000 0.0000
# the_bronx 0.0000 0.0000 0.0000 0.1875 0.0000 0.0000
This function also works with vectors
## Let's find the proportion of names within a vector of 'my_boroughs'
prop.table(table(my_dataframe_5$my_boroughs))
# brooklyn manhattan not_NYC queens staten_island the_bronx
# 0.1764706 0.1764706 0.1764706 0.1176471 0.1764706 0.1764706
<- as.numeric(prop.table(table(my_dataframe_5$my_boroughs)))
proportions_my_boroughs
proportions_my_boroughs#[1] 0.1764706 0.1764706 0.1764706 0.1176471 0.1764706 0.1764706
str(proportions_my_boroughs)
#num [1:6] 0.176 0.176 0.176 0.118 0.176 ...
## Let's assign the same names as in the table
names(proportions_my_boroughs) <- names(table(my_dataframe_5$my_boroughs))
proportions_my_boroughs# brooklyn manhattan not_NYC queens staten_island the_bronx
# 0.1764706 0.1764706 0.1764706 0.1176471 0.1764706 0.1764706
## This vector should add up to 1 (i.e., 100%)
sum(proportions_my_boroughs)
#[1] 1
3.6 Lists
36) A list object is also a very flexible data structure in the R environment. You can use this to collect pretty much any of the previous objects (e.g., scalars, vectors, matrices, data frames, functions). To build a list we use the function list()
.
# We can put a collection of diverse vector
<- list(c(1,2,3), my_dataframe, table_subset)
my_list
my_list#[[1]]
#[1] 1 2 3
#[[2]]
# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
#[[3]]
# my_text
#my_boroughs apple banana grape pear pomegranate potato
# brooklyn 0 0 3 0 0 0
# manhattan 0 0 0 0 3 0
# not_NYC 0 0 0 0 0 2
# queens 2 0 0 0 0 0
# staten_island 0 3 0 0 0 0
# the_bronx 0 0 0 3 0 0
37) You can also create an empty list is useful to fill with elements as desired by the user.
<- list()
my_list_to_fill
my_list_to_fill#list()
1]] <- "one"
my_list_to_fill[[2]] <- 1:10
my_list_to_fill[[3]] <- mean(my_list_to_fill[[2]]) # mean of an already an element of the list
my_list_to_fill[[
my_list_to_fill#[[1]]
#[1] "one"
#[[2]]
#[1] 1 2 3 4 5 6 7 8 9 10
#[[3]]
#[1] 5.5
38) Names can be assigned to list elements and also can select those by using the corresponding name assigned.
<- list(some_numbers = c(1,2,3), a_data_frame = my_dataframe)
my_list
my_list#$some_numbers
#[1] 1 2 3
#$a_data_frame
# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
"a_data_frame"]]
my_list[[# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
39) You can pull elements of a list by using its indices, and even an specific element within a list element.
#fist element of list
1]]
my_list[[#[1] 1 2 3
#get the third element of the first list
1]][3]
my_list[[#[1] 3
40) Removing elements from a list can be done by assigning NULL
to the selected element
my_list#$some_numbers
#[1] 1 2 3
#$a_data_frame
# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
2]] <- NULL
my_list[[
my_list#$some_numbers
#[1] 1 2 3
41) You can also flatten a list object (i.e., unlist or transform) into a vector by using the functionunlist()
.
<- list(some_numbers = c(1,2,3), a_data_frame = my_dataframe)
my_list
my_list#$some_numbers
#[1] 1 2 3
#$a_data_frame
# my_numbers my_text my_boroughs my_colors
#1 1 apple queens red
#2 2 pear the_bronx orange
#3 3 grape brooklyn blue
#4 4 pomegranate manhattan black
#5 5 banana staten_island green
<- unlist(my_list)
my_flatten_list
my_flatten_list# some_numbers1 some_numbers2 some_numbers3 a_data_frame.my_numbers1 a_data_frame.my_numbers2
# "1" "2" "3" "1" "2"
# a_data_frame.my_numbers3 a_data_frame.my_numbers4 a_data_frame.my_numbers5 a_data_frame.my_text1 a_data_frame.my_text2
# "3" "4" "5" "apple" "pear"
# a_data_frame.my_text3 a_data_frame.my_text4 a_data_frame.my_text5 a_data_frame.my_boroughs1 a_data_frame.my_boroughs2
# "grape" "pomegranate" "banana" "queens" "the_bronx"
#a_data_frame.my_boroughs3 a_data_frame.my_boroughs4 a_data_frame.my_boroughs5 a_data_frame.my_colors1 a_data_frame.my_colors2
# "brooklyn" "manhattan" "staten_island" "red" "orange"
# a_data_frame.my_colors3 a_data_frame.my_colors4 a_data_frame.my_colors5
# "blue" "black" "green"
Next, we will overview special data structures relevant for extremely long tables and bioinformatics (nucleotide/amino acid sequences). These data structures are usually developed and managed within specific R-packages.
3.7 Data Tables
42) A data.table is an improved version of a data frame and usually faster to analyze and filter extremely large datasets. To access this data structure requires to install or load R-package data.table. For details more, you can see this vignette.
# if you do not have installed this R-package
install.packages("data.table")
library(data.table)
<- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
input_csv_file <- fread(input_csv_file)
flights
flights# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# ---
#253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14
#253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8
#253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11
#253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11
#253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8
str(flights)
#Classes ‘data.table’ and 'data.frame': 253316 obs. of 11 variables:
#$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
#$ month : int 1 1 1 1 1 1 1 1 1 1 ...
#$ day : int 1 1 1 1 1 1 1 1 1 1 ...
#$ dep_delay: int 14 -3 2 -8 2 4 -2 -3 -1 -2 ...
#$ arr_delay: int 13 13 9 -26 1 0 -18 -14 -17 -14 ...
#$ carrier : chr "AA" "AA" "AA" "AA" ...
#$ origin : chr "JFK" "JFK" "JFK" "LGA" ...
#$ dest : chr "LAX" "LAX" "LAX" "PBI" ...
#$ air_time : int 359 363 351 157 350 339 338 356 161 349 ...
#$ distance : int 2475 2475 2475 1035 2475 2454 2475 2475 1089 2422 ...
#$ hour : int 9 11 19 7 13 18 21 15 15 18 ...
#- attr(*, ".internal.selfref")=<externalptr>
class(flights)
#[1] "data.table" "data.frame"
nrow(flights) # get number of rows of data.table
#[1] 253316
3.8 Tibbles
43) A tibble is also a modified form of a data frame used in the tidyverse family of R-packages. They are a more simplified form of data frames and used by some data management R-packages, e.g., readr, where they parse (i.e., import data or read data from a file) into the R environment. I am not a favorite of tibbles as it forces user to orbit around the ‘tidyverse’ for most things that a data frame can handle.
## install or load R-package 'readr' which is a powerful data parser for large text (tab or csv) files
install.packages('readr')
library(readr)
<- readr_example("mtcars.csv") # this a common long dataset of car specifications
reader_file_path <- read_delim(file = reader_file_path,
my_tibble delim = ",")
#Rows: 32 Columns: 11
#── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
#Delimiter: ","
#dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#
#ℹ Use `spec()` to retrieve the full column specification for this data.
#ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## use function spec() to extract the full column specification from a tibble created by readr.
spec(my_tibble)
#cols(
# mpg = col_double(),
# cyl = col_double(),
# disp = col_double(),
# hp = col_double(),
# drat = col_double(),
# wt = col_double(),
# qsec = col_double(),
# vs = col_double(),
# am = col_double(),
# gear = col_double(),
# carb = col_double()
#)
my_tibble# A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
3.9 DNAStringSets, RNAStringSets and AAStringSets
44) A XStringSet object is a special and flexible type of data frame like that contains sequences with different flavor: DNAStringSet, RNAStringSet or AAStringSet. To access this data structure requires to install or load R-package Biostrings. For details more, you can see the associated vignettes in that R-package site. We will return to this object when explore nucleotide and amino acid sequence manipulation.
For this introduction, we also need to install another powerful R-package rentrez. This package provides an R interface to the NCBIs EUtils API, allowing users to search databases like GenBank and PubMed, process the results of those searches and pull data into their R sessions.
# if you need to install R-packages 'Biostrings' and 'rentrez'
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# remember that you might not need to update all other packages (type n)
::install("Biostrings")
BiocManager
# if you need to install “rentrez”
install.packages("rentrez")
# load these libraries
library(Biostrings)
library(rentrez)
For this example, we will download some nucleotide sequences from NCBI from a charismatic frog, Allobates kingsburyi.
<- "Allobates kingsburyi[Organism]"
froggy_name <- entrez_search(db="nuccore", term=froggy_name)
froggy_seq_IDs # revising the structure of 'froggy_seq_IDs' that there are 17 sequences in NCBI nuccore database
str(froggy_seq_IDs)
#List of 5
#$ ids : chr [1:17] "1845966712" "1248341807" "328728168" "328728030" ...
#$ count : int 17
#$ retmax : int 17
#$ QueryTranslation: chr "\"Allobates kingsburyi\"[Organism]"
#$ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
#- attr(*, "class")= chr [1:2] "esearch" "list"
<- entrez_fetch(db="nuccore", id=froggy_seq_IDs$ids, rettype="fasta")
froggy_seqs_fasta
froggy_seqs_fasta#[1] ">MT524123.1 Allobates kingsburyi voucher QCAZA68477 large subunit ribosomal RNA gene, partial sequence;
#mitochondrial\nCCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGCCGCGGTATCCTAACCGTGC\nGAAGGTAGCGTAATCACTTGTCCTT
#TAAATGAGGACTAGTATGAACGGCTTCACGAAGGCTATGCTGTCT\nCCTTTATCTAATCAGTTAAACTAATCTCCCCGTGAAGAAGCGGGGATACACCTATAAGACGAGAA
#...
cat(froggy_seqs_fasta)
#>MT524123.1 Allobates kingsburyi voucher QCAZA68477 large subunit ribosomal RNA gene, partial sequence; mitochondrial
#CCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGCCGCGGTATCCTAACCGTGC
#GAAGGTAGCGTAATCACTTGTCCTTTAAATGAGGACTAGTATGAACGGCTTCACGAAGGCTATGCTGTCT
#CCTTTATCTAATCAGTTAAACTAATCTCCCCGTGAAGAAGCGGGGATACACCTATAAGACGAGAAGACCC
#TATGGAGCTTTAAATACTTTAAAACACCTGAATCTGACACTAGAAACTTCCAGAAAACTTTATTTAACAT
#ATCACTTTGTTTTAAACTTTAGGTTGGGGTGACCACGGAGAAAAAACCAACCTCCACGTAGAATGAAATT
#TTCTTTCTAAGCGATAAGCTACATCTTTATGCATCAATACATTGACCTAAATTGACCCAATTTTTTGATC
#AACGAAC
#
#>MF580102.1 Allobates kingsburyi nicotinic acetylcholine receptor beta-2 (chrnb2) gene, partial cds
#ATGACGGTTCTCCTCCTCCTCCTGCACCTCAGCCTGTTCGGCCTGGTCACCAGGAGTATGGGCACGGACA
#CCGAGGAGCGGCTCGTGGAATTCCTGCTGGACCCGTCCCAGTACAACAAGCTGATCCGGCCCGCCACCAA
#TGGATCCGAGCAGGTCACCGTCCAGCTGATGGTATCTCTGGCCCAGCTGATCAGCGTGCACGAGCGGGAG
#...
We can save these sequences in a text file in your working directory.
# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/my_working_directory")
write(froggy_seqs_fasta, "my_froggy_seqs_fasta.txt")
Now we can import the nucleotide sequences from this file that are in fasta format into R.
<- readDNAStringSet(filepath = "~/Desktop/Teach_R/my_working_directory/my_froggy_seqs_fasta.txt",
my_Biostrings_set format = "fasta")
my_Biostrings_set#A DNAStringSet instance of length 17
# width seq names
# [1] 427 CCTGATTAACCATAAGAGGTCAAGCCTGCCCAGTGACATTTGTTTAACGGC...TATGCATCAATACATTGACCTAAATTGACCCAATTTTTTGATCAACGAAC MT524123.1 Alloba...
# [2] 899 ATGACGGTTCTCCTCCTCCTCCTGCACCTCAGCCTGTTCGGCCTGGTCACC...CCCCCGACGTCCCTGGACGTCCCGCTCGTCGGCAAGTACCTGATGTTCAC MF580102.1 Alloba...
# [3] 4881 AAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATGCAA...CCTTTGTTTACTTCCTATCTCCCCATCCCTTCTCTGCCTGCTCAGAAACT HQ290963.1 Alloba...
# [4] 576 GTACATCATAATGTGAGCAGATGGGAAAGCTTTGATGTCACACCAGCTATT...TCTACCTTGATGAAAATGAAAAAGTTGTTTTGAAAAACTATCAAGACATG HQ291024.1 Alloba...
# [5] 510 AACTCCCCTTCAGGTTCACAATTTCCCTTCAGCGGCATTGACGACCGGGAA...TGCTTCAATGGGAGCATGAAATTCAGAAGCTCACGGGTGACGAGAACTTC HQ290901.1 Alloba...
#... ... ...
#[13] 2389 AGGCTTGGTCCTAACCTTGAAGTCAGTTACTAATTAATATACACATGCAAG...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342528.1 Alloba...
#[14] 2392 TAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATGC...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342527.1 Alloba...
#[15] 2393 TTAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACATG...CGACCTCGATGTTGGATCAGGATGTCCCAGTGGTGCAGCAGCTACTAATG EU342526.1 Alloba...
#[16] 2457 AAAGTTCTCCAACATAAAGGCTTGGTCCTAACCTTGAAGTCAGTTACTAAT...GTTCGTTTGTTCAACGATTAAAATCCTACGTGATCTGAGTTCAGACCGGA AY364550.1 Colost...
#[17] 2446 ATTAAAGGTTTGGTCCTAGCCTTGAAGTCAGTTACTAATTAATATACACAT...TTCGTTTGTTCAACGATTAAAATCCTACGTGATCTGAGTTCAAGACCGGA AY364549.1 Colost...
Next, we will download some amino acid sequences from NCBI from the same frog species, Allobates kingsburyi.
<- "Allobates kingsburyi[Organism]"
froggy_name <- entrez_search(db="protein", term=froggy_name)
froggy_AA_IDs # Entrez search result with 11 hits (object contains 11 IDs and no web_history object)
# Search term (as translated): "Allobates kingsburyi"[Organism]
str(froggy_AA_IDs)
#List of 5
#$ ids : chr [1:11] "1248341808" "328728170" "328728169" "328728031" ...
#$ count : int 11
#$ retmax : int 11
#$ QueryTranslation: chr "\"Allobates kingsburyi\"[Organism]"
#$ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
#- attr(*, "class")= chr [1:2] "esearch" "list"
<- entrez_fetch(db="protein", id=froggy_AA_IDs$ids, rettype="fasta")
froggy_AA_fasta
froggy_AA_fasta#[1] ">ATG31804.1 nicotinic acetylcholine receptor beta-2, partial [Allobates kingsburyi]\nMTVLLLLLHLSLFGLV
#TRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQVTVQLMVSLAQLISVHERE\nQIMTTNVWLTQEWXXXXXXXXXXXXXXXXXXXXXXXXXWLPDVVLYNNADGMY
#EVSFYSNAVVSHDGSIF\nWLPPAIYKSACKIEVKHFPFDQQNCTMKFRSWTYDRTELDLVLKSDVASLDDFTPSGEWDIIALPGRRNE\nNPEDSTYVDITYDFIIRRKPL
#...
cat(froggy_AA_fasta)
#>ATG31804.1 nicotinic acetylcholine receptor beta-2, partial [Allobates kingsburyi]
#MTVLLLLLHLSLFGLVTRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQVTVQLMVSLAQLISVHERE
#QIMTTNVWLTQEWXXXXXXXXXXXXXXXXXXXXXXXXXWLPDVVLYNNADGMYEVSFYSNAVVSHDGSIF
#WLPPAIYKSACKIEVKHFPFDQQNCTMKFRSWTYDRTELDLVLKSDVASLDDFTPSGEWDIIALPGRRNE
#NPEDSTYVDITYDFIIRRKPLFYTINLIIPCILITSLAILVFYLPSDCGEKMTLCISVLLALTVFLLLIS
#KIVPPTSLDVPLVGKYLMFT
#
#>AEB39272.1 NADH dehydrogenase subunit 2 (mitochondrion) [Allobates kingsburyi]
#MNPYALFLIISSLALGTSIAVSSFHWILAWIGLEINTLAIIPLMTKNPHPRSIEAATKYFLTQAAASSLI
#LFSCALNAWLLGEWTINNLMSPASMIFLSIALSTKLGLAPFHFWLPEVLQGLTLQTGWILSTWQKLAPLA
#ILFQLSQSINLLLMMSMGLLSILVGGWGGINQNQIRKILAFSSIAHLGWMITILKISPQLSLLNFILYII
#MTSALFYTFIMIDSTNISHLATTWTKIPTLTALSLMSLLSLSGLPPLTGFLPKWLIAQELINQNLIILPF
#LMLMLTLLALFFYLRLTYTISLTMAPNSTSSVSLWYQKKKNNLTIFILLTLCLLPISPSLLCLL
We can save these AA sequences in a text file in your working directory.
# this is exclusive to your OWN COMPUTER change it accordingly
setwd("~/Desktop/Teach_R/my_working_directory")
write(froggy_AA_fasta, "my_froggy_AA_fasta.txt")
Now we can import the amino acid sequences of this file that are in fasta format.
<- readAAStringSet(filepath = "~/Desktop/Teach_R/my_working_directory/my_froggy_AA_fasta.txt",
my_AA_Biostrings_set format = "fasta")
my_AA_Biostrings_set#A AAStringSet instance of length 11
# width seq names
# [1] 300 MTVLLLLLHLSLFGLVTRSMGTDTEERLVEFLLDPSQYNKLIRPATNGSEQ...VFYLPSDCGEKMTLCISVLLALTVFLLLISKIVPPTSLDVPLVGKYLMFT ATG31804.1 nicoti...
# [2] 344 MNPYALFLIISSLALGTSIAVSSFHWILAWIGLEINTLAIIPLMTKNPHPR...RLTYTISLTMAPNSTSSVSLWYQKKKNNLTIFILLTLCLLPISPSLLCLL AEB39272.1 NADH d...
# [3] 320 LNIFTLTQSLCYMVPILLAVAFLTLLERKVLGYMQHRKGPNVIGPTGLLQP...LFLWVRASYPRFRYDQLMHLVWKNFLPMTLALTIWFITFPIIFLFSPPIL AEB39271.1 NADH d...
# [4] 192 VHHNVSRWESFDVTPAIIRWIAHRQPNHGFVVEVTQLDCEKNVTKRHVRIS...STNHAIVQTLVNSVNSNIPKACCVPTELSAISMLYLDENEKVVLKNYQDM AEB39195.1 bone m...
# [5] 170 NSPSGSQFPFSGIDDRENWPIVFYNRTCQCQGNFMGYNCGDCKFXFTGXNC...YASRDAFLEGDLVWQNIDFAHEAPAFLPWHRFFLLQWEHEIQKLTGDENF AEB39135.1 tyrosi...
# ... ... ...
# [7] 192 TTMDKRNLPESSMNSLFIKLMQADLLKNKIPKQVVNAKEIKQQSTIPKAEI...VTNKSNAIDIRGHQVAVLGEIKTGNSPVKQYFYETRCKDARPVKSGCRGI AEB39015.1 neurot...
# [8] 414 TIKKPNGETTKTTVRIWNETVSNLTLMALGSSAPEILLSVIEVCGHNFQAG...GIIDDDIFEEDENFLVHLSNVRVNAETTEVNFESNHVTSLACLGSPSTAT AEB38955.1 sodium...
# [9] 309 CIGLISVNGRMRNNMKAGSSPNSVSSSPTNSAITQLRHKLENGKPLGMNES...PIPLHQHERYLCKMNEEIKAVLQPSENLILNKQGMFAEKQALLLSSVLSE AEB38895.1 zinc f...
#[10] 201 VRGQSGLAYPGLRTHGTLESIGGPMSSSRGGGLPSLTDTFEHVIEELLEEE...QLKQYFYETKCNPMGYMKEGCRGIDKRYWNSQCRTTQSYVRALTMDSKKK AEB38835.1 brain-...
#[11] 230 GLCLIAQIITGLFLAMHYTADTTMAFSSIAHICRDVNNGWLLRSLHANGAS...VPFHAYFSYKDALGFIILLVLLSLLSLFSPNLLGDPDNFTPANPLVTPPH AEB33649.1 cytoch...