Session 13 – Mining from a reference dataset

Sometimes you have a list of individuals with some indices and you want to retrieve values associated from a large reference database or data frame. Using these indices will help you retrieve data from these references without going to the painstaking work to copy-and-pasting such value one by one.

13.1 A worked example using the mtcars dataset

We will assume that you have four friends (Joe, Moe, Larry, Fred) and each one has a collection of cars. They want to compare the overall collection based on statistics provided by mtcars dataset.

13.2 Build a reference dataset

1) We will create the reference database as mtcars_data from the preloaded mtcars dataset. This dataset exists in the R environment. You can import a data frame (e.g., a txt or excel file) as data frame, but make sure that you include the indexing variable: my_index as demonstrated below.

## this is an example to create a reference index
mtcars_data <- mtcars

## we need to create a indexing variable, so will use car names
mtcars_names_vector <- rownames(mtcars_data)
mtcars_names_vector
#[1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"          "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
# [7] "Duster 360"          "Merc 240D"           "Merc 230"            "Merc 280"            "Merc 280C"           "Merc 450SE"         
#[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood"  "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"       "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"       "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#[31] "Maserati Bora"       "Volvo 142E"         

## this vector has spaces (" ") and dashes (-), these might cause problems with some analyses and functions, we will replace them with underscores "_"
mtcars_names_vector <- gsub(" ", "_", mtcars_names_vector)
mtcars_names_vector <- gsub("-", "_", mtcars_names_vector)
mtcars_names_vector
#[1] "Mazda_RX4"           "Mazda_RX4_Wag"       "Datsun_710"          "Hornet_4_Drive"      "Hornet_Sportabout"   "Valiant"            
# [7] "Duster_360"          "Merc_240D"           "Merc_230"            "Merc_280"            "Merc_280C"           "Merc_450SE"         
#[13] "Merc_450SL"          "Merc_450SLC"         "Cadillac_Fleetwood"  "Lincoln_Continental" "Chrysler_Imperial"   "Fiat_128"           
#[19] "Honda_Civic"         "Toyota_Corolla"      "Toyota_Corona"       "Dodge_Challenger"    "AMC_Javelin"         "Camaro_Z28"         
#[25] "Pontiac_Firebird"    "Fiat_X1_9"           "Porsche_914_2"       "Lotus_Europa"        "Ford_Pantera_L"      "Ferrari_Dino"       
#[31] "Maserati_Bora"       "Volvo_142E"         

## we can now add this vector as our indexing variable
mtcars_data$my_index <- mtcars_names_vector
head(mtcars_data)
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb          my_index
#Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda_RX4
#Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda_RX4_Wag
#Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun_710
#Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet_4_Drive
#Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet_Sportabout
#Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1           Valiant

13.3 Build a sampling dataset

2) Now, we will create a data frame to be used to retrive the information for the friends_data. Our indexing varaible will be my_index, so the information on mtcars_data can be combined, merged or subset to match the friends_data. A similar procedure can be performed by importing a data frame (e.g., a txt or excel file) with the same data frame, make sure that my_index variable is present.

This is an illustration of how to make a data frame with data that can be indexed.

## we will start with four random vectors of the my_index variable to serve as what cars these friends have

Joe_10_cars_vector <- sample(mtcars_data$my_index, 10, replace = TRUE)
Moe_15_cars_vector <- sample(mtcars_data$my_index, 15, replace = TRUE)
Larry_12_cars_vector <- sample(mtcars_data$my_index, 12, replace = TRUE)
Fred_20_cars_vector <- sample(mtcars_data$my_index, 20, replace = TRUE)

## NOTICE: that this are random samples and each of you will get a different set.

## we build a two-column data frame as follows by making a list of data frames and collect each one

friends_list <- list()

## Joe
Joe_df <- data.frame(my_index = Joe_10_cars_vector, stringsAsFactors = FALSE)
Joe_df$friend <- "Joe"
friends_list[[1]] <- Joe_df

## Moe
Moe_df <- data.frame(my_index = Moe_15_cars_vector, stringsAsFactors = FALSE)
Moe_df$friend <- "Moe"
friends_list[[2]] <- Moe_df

## Larry
Larry_df <- data.frame(my_index = Larry_12_cars_vector, stringsAsFactors = FALSE)
Larry_df$friend <- "Larry"
friends_list[[3]] <- Larry_df

## Fred
Fred_df <- data.frame(my_index = Fred_20_cars_vector, stringsAsFactors = FALSE)
Fred_df$friend <- "Fred"
friends_list[[4]] <- Fred_df

## we rbind these data frames into one

friends_data <- do.call(rbind,friends_list)
str(friends_data)
#'data.frame': 57 obs. of  2 variables:
# $ my_index: chr  "Porsche_914_2" "Dodge_Challenger" "Merc_450SE" "Fiat_X1_9" ...
# $ friend  : chr  "Joe" "Joe" "Joe" "Joe" ...
head(friends_data)
#          my_index friend
#1    Porsche_914_2    Joe
#2 Dodge_Challenger    Joe
#3       Merc_450SE    Joe
#4        Fiat_X1_9    Joe
#5        Merc_280C    Joe
#6         Merc_230    Joe

13.4 Extracting from reference dataset

3) We can subset the mtcars_data to match the friends_data using the function merge and the argument by.x = "my_index". This will subset and merge from the mtcars_data the information about the cars based only on values of my_index in friends_data.

## with reference data frame

friends_data_with_mtcars_data <- merge(friends_data, mtcars_data, by.x = "my_index")
head(friends_data_with_mtcars_data)
#            my_index friend  mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1        AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#2        AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#3 Cadillac_Fleetwood  Larry 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
#4 Cadillac_Fleetwood  Larry 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
#5  Chrysler_Imperial  Larry 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
#6         Datsun_710    Moe 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

## we can order by friend name

friends_data_with_mtcars_data_order <- friends_data_with_mtcars_data[order(friends_data_with_mtcars_data$friend),]
head(friends_data_with_mtcars_data_order)
#       my_index friend  mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1   AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#2   AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#7    Datsun_710   Fred 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#8    Datsun_710   Fred 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#12   Duster_360   Fred 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
#13 Ferrari_Dino   Fred 19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

4) You can determine different parameters of the car sets per each friend and compere between them

## We will split the large data frame into each friend

list_friends <- split(friends_data_with_mtcars_data_order, friends_data_with_mtcars_data_order$friend)

## names on the lists

names(list_friends)
#[1] "Fred"  "Joe"   "Larry" "Moe"  

## fred
fred_df <- list_friends[[1]]
head(fred_df)
#       my_index friend  mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1   AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#2   AMC_Javelin   Fred 15.2   8  304 150 3.15 3.435 17.30  0  0    3    2
#7    Datsun_710   Fred 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#8    Datsun_710   Fred 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#12   Duster_360   Fred 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
#13 Ferrari_Dino   Fred 19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

## you can do the same as: fred_df <- subset(friends_data_with_mtcars_data_order, friend == "Fred")

joe_df <- list_friends[[2]]
larry_df <- list_friends[[3]]
moe_df <- list_friends[[4]]

We can ask which friend has more weight in their car set?

sum(fred_df$wt)
#[1] 58.893
sum(joe_df$wt)
#[1] 30.205
sum(larry_df$wt)
#[1] 44.445
sum(moe_df$wt)
#[1] 44.445

In this case, Fred clearly has won. We can ask which friend has more horsepower in their car set?

sum(fred_df$hp)
#[1] 2756
sum(joe_df$hp)
#[1] 1166
sum(larry_df$hp)
#[1] 1816
sum(moe_df$hp)
#[1] 1816

13.5 Plots of the extracted data

5) We can do some plots to exemplify some of these observations.

We can plot boxplots for hp.

## check ig ggplot2 is loaded
require(ggplot2)

## Boxplots with standard color
friends_hp_boxplot <- ggplot(friends_data_with_mtcars_data_order, aes(x=friend, y=hp, fill=friend)) + 
                       geom_boxplot(width=0.3) +
                     labs(title="Friends car collection -- HP boxplots", x="Friends", y="hp") +
                     guides(fill=FALSE) +
                     theme_minimal()
friends_hp_boxplot

We can plot density for wt.

## Overlaid densities
friends_wt_plot_den <- ggplot(friends_data_with_mtcars_data_order, aes(x=wt, fill=friend)) +
                    geom_density(alpha=.3) +
                     ggtitle("Friends car collection -- WT density plots") +
                     theme_minimal()
friends_wt_plot_den

We can have a scatterplot for mpg by hp.

## scatterplot
friends_scatterplot_mpg_hp <- ggplot(friends_data_with_mtcars_data_order, aes(x=mpg,  y = hp, color=friend)) +
                       geom_point() +
                       ggtitle("Friends car collection -- mpg versus hp") +
                       theme_minimal()
friends_scatterplot_mpg_hp