Session 8 – Multidimensional Data in R
Many datasets derive from multiple variables measured from hundreds of individuals and many of such parameters are obtained simultaneously (i.e., multidimensional) and together are often difficult to interpret. Most researchers aim to reduce such complexity, identify groups of interest or increase data interpretability using diverse methods like principal component analysis (PCA), multidimensional scaling (MDS), cluster analyses among others. We will explore some of these methods.
8.1 Principal Component Analysis
PCA is a common approach to reduce dimensionality in a dataset. For instance, PCA reduce the data complexity from many variables into few summary components, while preserving much of the statistical information. During the analysis, new variables (named components) are found that are linear functions of those in the original dataset, that successively maximize variance and that are uncorrelated with each other, see Jolliffe and Cadima 2016.
PCA is recommended when you have large dataset of variables (e.g., > 8) and many individuals (>20 or better in the hundreds) with measurements of such variables. For this example, we will use the env
data frame of the doubs
dataset from the R-package ade4. The env
is a data frame with 30 rows (sites) and 11 environmental variables.
## we can install ade4 and load the dataset for this example
install.packages("ade4")
library(ade4)
##load the doubs dataset into the R environment
data(doubs)
## dataset of eological tables for environmental variables, fish species and spatial coordinates for 30 sites.
?doubs
## we will use only the environmental variables
<- doubs$env
env_raw head(env_raw)
# dfs alt slo flo pH har pho nit amm oxy bdo
#1 3 934 6.176 84 79 45 1 20 0 122 27
#2 22 932 3.434 100 80 40 2 20 10 103 19
#3 102 914 3.638 180 83 52 5 22 5 105 35
#4 185 854 3.497 253 80 72 10 21 0 110 13
#5 215 849 3.178 264 81 84 38 52 20 80 62
#6 324 846 3.497 286 79 60 20 15 0 102 53
We can start by getting a pairwise bivariate plots of these 11 variables.
## we can install ade4 and load the dataset for this example
pairs(env_raw, pch = 19, cex=0.5)
We notice a relative variability in these pairwise plots. Some show a linear relationship, others do not. Likewise, some scales differences are evident (some are in the hundreds, others much less).
## we can use the function modify() form the R-package **purrr** to standardize this dataset using log
library(purrr)
<- purrr::modify(env_raw, log)
env_log head(env_log)
# dfs alt slo flo pH har pho nit amm oxy bdo
#1 1.098612 6.839476 1.820671 4.430817 4.369448 3.806662 0.0000000 2.995732 -Inf 4.804021 3.295837
#2 3.091042 6.837333 1.233726 4.605170 4.382027 3.688879 0.6931472 2.995732 2.302585 4.634729 2.944439
#3 4.624973 6.817831 1.291434 5.192957 4.418841 3.951244 1.6094379 3.091042 1.609438 4.653960 3.555348
#4 5.220356 6.749931 1.251905 5.533389 4.382027 4.276666 2.3025851 3.044522 -Inf 4.700480 2.564949
#5 5.370638 6.744059 1.156252 5.575949 4.394449 4.430817 3.6375862 3.951244 2.995732 4.382027 4.127134
#6 5.780744 6.740519 1.251905 5.655992 4.369448 4.094345 2.9957323 2.708050 -Inf 4.624973 3.970292
## Notice some -Inf value, we will remove that column 9 that corresponds to ‘amm’
<- env_log[,-9]
env_log_red head(env_log_red)
# dfs alt slo flo pH har pho nit oxy bdo
#1 1.098612 6.839476 1.820671 4.430817 4.369448 3.806662 0.0000000 2.995732 4.804021 3.295837
#2 3.091042 6.837333 1.233726 4.605170 4.382027 3.688879 0.6931472 2.995732 4.634729 2.944439
#3 4.624973 6.817831 1.291434 5.192957 4.418841 3.951244 1.6094379 3.091042 4.653960 3.555348
#4 5.220356 6.749931 1.251905 5.533389 4.382027 4.276666 2.3025851 3.044522 4.700480 2.564949
#5 5.370638 6.744059 1.156252 5.575949 4.394449 4.430817 3.6375862 3.951244 4.382027 4.127134
#6 5.780744 6.740519 1.251905 5.655992 4.369448 4.094345 2.9957323 2.708050 4.624973 3.970292
## we can install ade4 and load the modified dataset for this example
pairs(env_log_red, pch = 19, cex=0.5)
We can run PCA using the preloaded function in R or with several specialized packages. We will start with the functions prcomp()
and summary()
.
## apply PCA - scale. = TRUE is highly advisable, but default is FALSE.
<- prcomp(env_log_red, center = TRUE, scale. = TRUE)
env_prcomp
env_prcomp#Standard deviations (1, .., p=10):
# [1] 2.5031260 1.2651443 0.9875811 0.6860815 0.5095091 0.3946341 0.3212448 0.2824013 0.2580574 0.1512203
#
#Rotation (n x k) = (10 x 10):
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
#dfs 0.36486258 -0.18617302 0.1482270 -0.27425018 0.070856291 -0.06822936 0.375972370 0.04774989 -0.73277048 0.211581601
#alt -0.36451449 0.12764399 -0.1690422 -0.36872511 0.165563399 0.30103670 0.159545574 0.45954368 -0.16927250 -0.550436218
#slo -0.33426066 0.15461463 0.2062596 -0.41599181 -0.694152941 0.08059586 -0.008314283 -0.38539136 -0.10748520 0.016710642
#flo 0.35536702 -0.26645120 0.2469480 0.07962110 -0.140470255 -0.21095156 0.148040510 -0.22735329 0.11111159 -0.764991160
#pH 0.02686455 -0.43046956 -0.8301590 -0.07333534 -0.270021833 -0.14147519 -0.099820286 -0.07667608 -0.09705901 -0.035234427
#har 0.32775459 -0.15043755 0.1025784 -0.71869778 0.192865111 -0.02065425 -0.450542106 0.06200716 0.30538254 0.071982806
#pho 0.35978479 0.15025127 -0.2384279 -0.09780973 0.004998182 0.59283974 0.493290882 -0.25929815 0.33522306 0.084468872
#nit 0.36509191 -0.03139547 0.1053406 0.23840435 -0.511528294 0.41078684 -0.313263865 0.51006660 -0.09509725 -0.006021563
#oxy -0.25192228 -0.56073462 0.2173434 -0.06473107 -0.174757541 -0.08405657 0.438140938 0.36832297 0.39110186 0.229102408
#bdo 0.24827761 0.55517993 -0.1762335 -0.13016095 -0.252295130 -0.55318468 0.254587830 0.33694829 0.18286003 0.026082994
The output of env_prcomp
includes the standard deviation of each of the 10 principal components (PCs). You should expect as many PCs as the number of variables in your dataset (i.e., 10 variables, 10 PCs). The values under Rotation (n x k)
are the loadings of each of the variables in each component (PC) and provide information about how each variables contribute to their calculation. Loadings range from -1 to 1 and higher the absolute value of these loadings on a PC, the more that variable is related to that PC (i.e., it is better represented). Loadings with absolute values >0.4 are usually consider to have a strong relationship with that components. Notice the sign of the loading, some are positive, and some are negative. Positive is direct relationship, while a negative is an inverse relationship.
## we can see the structure of the analysis
str(env_prcomp)
#List of 5
# $ sdev : num [1:10] 2.503 1.265 0.988 0.686 0.51 ...
#$ rotation: num [1:10, 1:10] 0.3649 -0.3645 -0.3343 0.3554 0.0269 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:10] "dfs" "alt" "slo" "flo" ...
# .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
#$ center : Named num [1:10] 6.887 6.019 0.943 7.153 4.388 ...
# ..- attr(*, "names")= chr [1:10] "dfs" "alt" "slo" "flo" ...
#$ scale : Named num [1:10] 1.6786 0.5747 0.3863 1.3041 0.0214 ...
# ..- attr(*, "names")= chr [1:10] "dfs" "alt" "slo" "flo" ...
#$ x : num [1:30, 1:10] -6 -4.96 -3.6 -3.18 -1.24 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:30] "1" "2" "3" "4" ...
# .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
# - attr(*, "class")= chr "prcomp"
## we can get a summary
summary(env_prcomp)
#Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
#Standard deviation 2.5031 1.2651 0.98758 0.68608 0.50951 0.39463 0.32124 0.28240 0.25806 0.15122
#Proportion of Variance 0.6266 0.1601 0.09753 0.04707 0.02596 0.01557 0.01032 0.00798 0.00666 0.00229
#Cumulative Proportion 0.6266 0.7866 0.88415 0.93123 0.95719 0.97276 0.98308 0.99105 0.99771 1.00000
Thesummary(env_prcomp)
output describe the importance of each PC. The proportion of variance indicates the proportion of the total variance in the data explained by each PC. We can notice the values of the second row are progressively added resulting in a cumulative proportion of explained variance. In most cases, the first and second PC accounts for most of the variance explain (i.e., PC1 – 62.66% + PC2 – 16.01% = 78.66%). The rest of PCs are progressively less and less relevant to explain the variance of the data and they are usually ignored.
For most purposes, the first two PCs are used for plots and visualizing dimensionality reduction. However, we can determine how many PCs to retain using a scree plot and other references using the R-package factoextra.
We can get an scree plot with the function fviz_eig()
.
## install package, if not present. It also requires the R-package ggplot2.
install.packages("factoextra")
install.packages("ggplot2")
library(factoextra)
#Loading required package: ggplot2
#Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
::fviz_eig(env_prcomp) factoextra
This scree plot shows the eigenvalues and these represent the total amount of variance that can be explained by a given principal component. For example, for PC1 its eigenvalue is 62.66%.Therefore, these plots provide information on the percentage of variances explained by each PC and support our intuition that the first two PCs are the most useful to summarize the variability of our dataset. For instance, we see a marked drop in the percentage of explained variance after PC2, and we can ignore the rest of PCs (3-10) that explain relatively little variation. Thus, we should focus primarily on PC1 and PC2.
We can graph our samples (i.e., individuals on our dataset) with their PC1 and PC2 coordinates using the function fviz_pca_ind()
.
::fviz_pca_ind(env_prcomp,
factoextracol.ind = "cos2",
gradient.cols = c("firebrick4", "gray", "dodgerblue4"),
repel = TRUE)
Individuals with a similar profile tend to group together and color cos2
refers to the quality of representation. Notice that you can choose the color gradient that you want with the argument gradient.cols
, it might need 3 colors.
### NOTE: Sometimes you get this error.
#Error in UseMethod("depth") :
# no applicable method for 'depth' applied to an object of class "NULL"
### Try to resubmit the fviz_pca_ind() after closing the graph window.
We can also graph of variables that we used in our PCA using the function fviz_pca_var()
.
::fviz_pca_var(env_prcomp,
factoextracol.var = "contrib",
gradient.cols = c("dodgerblue4", "#E7B800", "firebrick4"),
repel = TRUE)
We can see that positive correlated variables point to the same side of the plot, while negative correlated variables point to opposite sides of the graph. Notice that you can choose the color gradient that you want with the argument gradient.cols
, it might need 3 colors.
### NOTE: Sometimes you get this error.
#Error in UseMethod("depth") :
# no applicable method for 'depth' applied to an object of class "NULL"
### Try to resubmit the fviz_pca_ind() after closing the graph window.
We can also graph plot both individuals and variables at the same time using the function fviz_pca_biplot()
.
::fviz_pca_biplot(env_prcomp,
factoextracol.var = "gray", # Variables color
col.ind = "dodgerblue4", # Individuals color
repel = TRUE)
We can extract other metrics of our PCA with several functions of factoextra.
The eigenvalues can be printed with the function get_eigenvalue()
describe the importance of each PC. Similar to summary(env_prcomp)
.
## Eigenvalues
<- factoextra::get_eigenvalue(env_prcomp)
eig_val_env_prcomp
eig_val_env_prcomp# eigenvalue variance.percent cumulative.variance.percent
#Dim.1 6.26563999 62.6563999 62.65640
#Dim.2 1.60059015 16.0059015 78.66230
#Dim.3 0.97531642 9.7531642 88.41547
#Dim.4 0.47070789 4.7070789 93.12254
#Dim.5 0.25959955 2.5959955 95.71854
#Dim.6 0.15573609 1.5573609 97.27590
#Dim.7 0.10319822 1.0319822 98.30788
#Dim.8 0.07975052 0.7975052 99.10539
#Dim.9 0.06659361 0.6659361 99.77132
#Dim.10 0.02286757 0.2286757 100.00000
The variable results can be printed with the function get_pca_var()
.
## Results for Variables
<- factoextra::get_pca_var(env_prcomp)
res_var_env_prcomp # Coordinates
$coord
res_var_env_prcomp# Contributions to the PCs
$contrib
res_var_env_prcomp# Quality of representation
$cos2 res_var_env_prcomp
The individual (i.e., samples) associated results can be printed with the function get_pca_ind()
.
## Results for individuals
<- factoextra::get_pca_ind(env_prcomp)
res_ind_env_prcomp # Coordinates -- this is same as env_prcomp$x
$coord
res_ind_env_prcomp# Contributions to the PCs
$contrib
res_ind_env_prcomp# Quality of representation
$cos2 res_ind_env_prcomp
8.2 Multidimensional scaling
We use multidimensional scaling helps to visualize similarities of individual cases or samples of a dataset. In this analysis, we reduce dimensions to be able to visualize or represent data points in a more friendly manner. We start with a set of samples or individuals measured on several numeric variables. After the analysis, we get a representation of the distances among these samples or individuals in a lower dimensional space (e.g., 2 dimensions), so we can infer the relative relationship (distance) between them. In other words, individuals or samples that are closer are more similar.
We will use our example `env_log_red, but we need to add names to the sites for visualization.
## We can name our samples by assigned a unique row name.
<- env_log_red
env_log_red_MDS rownames(env_log_red_MDS) <- paste0("site_", 1:nrow(env_log_red_MDS))
head(env_log_red_MDS)
# dfs alt slo flo pH har pho nit oxy bdo
#site_1 1.098612 6.839476 1.820671 4.430817 4.369448 3.806662 0.0000000 2.995732 4.804021 3.295837
#site_2 3.091042 6.837333 1.233726 4.605170 4.382027 3.688879 0.6931472 2.995732 4.634729 2.944439
#site_3 4.624973 6.817831 1.291434 5.192957 4.418841 3.951244 1.6094379 3.091042 4.653960 3.555348
#site_4 5.220356 6.749931 1.251905 5.533389 4.382027 4.276666 2.3025851 3.044522 4.700480 2.564949
#site_5 5.370638 6.744059 1.156252 5.575949 4.394449 4.430817 3.6375862 3.951244 4.382027 4.127134
#site_6 5.780744 6.740519 1.251905 5.655992 4.369448 4.094345 2.9957323 2.708050 4.624973 3.970292
We can perform MDS using the function dis()
and cmdscale()
for metric or classical MDS with two dimensions k = 2
.
## Classical MDS: N rows (objects) x p columns (variables)
# get Euclidean distances between the rows
<- dist(env_log_red_MDS)
env_log_red_MDS_d # k is the number of dimensions
<- cmdscale(env_log_red_MDS_d,eig=TRUE, k=2)
env_log_red_MDS_fit str(env_log_red_MDS_fit)
#List of 5
# $ points: num [1:30, 1:2] -7.21 -5.58 -3.84 -3.11 -1.83 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:30] "site_1" "site_2" "site_3" "site_4" ...
# .. ..$ : NULL
#$ eig : num [1:30] 203.13 25.14 9.43 4.54 3.22 ...
#$ x : NULL
#$ ac : num 0
#$ GOF : num [1:2] 0.92 0.92
We can plot this samples using the x-y coordinates in env_log_red_MDS_fit$points
.
## plot solution
<- env_log_red_MDS_fit$points[,1]
x <- env_log_red_MDS_fit$points[,2]
y plot(x, y, xlab= "Coordinate 1 - MDS", ylab="Coordinate 2 - MDS", main="Metric MDS for env_log_red", type="n")
text(x, y, labels = row.names(env_log_red_MDS), cex=.7)
We use nonmetric multidimensional scaling to visualizing the function isoMDS()
of the R-package MASS.
## For nonmetric MDS, we load MASS (we installed this package before)
library(MASS)
<- dist(env_log_red_MDS)
env_log_red_MDS_d <- MASS::isoMDS(env_log_red_MDS_d, k=2)
env_log_red_NMDS str(env_log_red_NMDS)
#List of 2
# $ points: num [1:30, 1:2] -7.64 -5.6 -3.74 -3.08 -1.72 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:30] "site_1" "site_2" "site_3" "site_4" ...
# .. ..$ : NULL
# $ stress: num 2.49
## plot solution NMDS solution
<- env_log_red_NMDS$points[,1]
x <- env_log_red_NMDS$points[,2]
y plot(x, y, xlab= "Coordinate 1 - NMDS", ylab="Coordinate 2 - NMDS", main="Nonmetric MDS for env_log_red", type="n")
text(x, y, labels = row.names(env_log_red_MDS), cex=.7)
8.3 Cluster analysis: DFA
Many biological datasets might have an underlying structure were groups or clusters that can be outcome of a treatment. Finding those set of objects that are more similar might help to identify groups or clusters that can used to classification of other samples.
We use a classic example known as the Fisher’s or Anderson’s Iris data set, which is a dataset of multiple measurements introduced by Ronald Fisher to illustrate taxonomic problems as an example of linear discriminant analysis. This dataset consists of 50 samples of three species of flowers: Iris setosa, Iris virginica and Iris versicolor. The dataset includes four features of the flower measured in centimeters. This dataset can be loaded from the R-package datasets.
install.packages("datasets")
library(datasets)
data(iris)
<- iris
iris_dataset summary(iris_dataset)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
#1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
#Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
#Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
head(iris_dataset)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
We can perform discriminant function analysis (DFA) or Linear Discriminant Analysis (LDA) to find a function (e.g., linear classifier) to separate samples in to two or more classes. For this analysis, we will the function lda()
from the package MASS.
## we have used the packages MASS and ggplot2.
library(MASS)
library(ggplot2)
## we get the names of predictors (i.e., measurements) and target group
str(iris_dataset)
#'data.frame': 150 obs. of 5 variables:
#$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We see that the grouping variable is Species
with the three species and four continuous measurements associated with each individual in the dataset. In this case, we input a formula Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
.
## we fit the lda function
<- lda(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris_dataset)
iris_lda_model
iris_lda_model#Call:
#lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
# data = iris_dataset)
#
#Prior probabilities of groups:
# setosa versicolor virginica
# 0.3333333 0.3333333 0.3333333
#
#Group means:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa 5.006 3.428 1.462 0.246
#versicolor 5.936 2.770 4.260 1.326
#virginica 6.588 2.974 5.552 2.026
#
#Coefficients of linear discriminants:
# LD1 LD2
#Sepal.Length 0.8293776 0.02410215
#Sepal.Width 1.5344731 2.16452123
#Petal.Length -2.2012117 -0.93192121
#Petal.Width -2.8104603 2.83918785
#
#Proportion of trace:
# LD1 LD2
#0.9912 0.0088
We can now evaluate how good LDA can discriminate our dataset. We need to install two R-packages caret and e1071.
## we install these packages
install.packages("caret")
install.packages("e1071")
library(caret)
library(e1071)
We need to fit predictive model over our data using the function train()
and obtain the confusion matrix confusionMatrix()
that is table that determines how good was the classification using LDA.
<- caret::train(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, method = "lda", data = iris_dataset)
iris_lda_predict
iris_lda_predict#Linear Discriminant Analysis
#
#150 samples
# 4 predictor
# 3 classes: 'setosa', 'versicolor', 'virginica'
#
#No pre-processing
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
#Resampling results:
#
# Accuracy Kappa
# 0.9697408 0.9542207
<- confusionMatrix(iris_dataset$Species, predict(iris_lda_predict, iris_dataset))
iris_lda_confusion_matrix
iris_lda_confusion_matrix#Confusion Matrix and Statistics
#
# Reference
#Prediction setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 48 2
# virginica 0 1 49
#
#Overall Statistics
#
# Accuracy : 0.98
# 95% CI : (0.9427, 0.9959)
# No Information Rate : 0.34
# P-Value [Acc > NIR] : < 2.2e-16
#
# Kappa : 0.97
#
# Mcnemar's Test P-Value : NA
#
#Statistics by Class:
#
# Class: setosa Class: versicolor Class: virginica
#Sensitivity 1.0000 0.9796 0.9608
#Specificity 1.0000 0.9802 0.9899
#Pos Pred Value 1.0000 0.9600 0.9800
#Neg Pred Value 1.0000 0.9900 0.9800
#Prevalence 0.3333 0.3267 0.3400
#Detection Rate 0.3333 0.3200 0.3267
#Detection Prevalence 0.3333 0.3333 0.3333
#Balanced Accuracy 1.0000 0.9799 0.9753
We can see the confusion matrix and values that not consider between Reference and Prediction are discrepancies between what we assigned and what LDA determined as a part of a group. We can add such predictions to the iris_dataset
using the function predict()
.
<- predict (iris_lda_predict, iris_dataset)
iris_lda_predicton $lda_species_prediction <- iris_lda_predicton
iris_dataset
iris_dataset # Sepal.Length Sepal.Width Petal.Length Petal.Width Species lda_species_prediction
#1 5.1 3.5 1.4 0.2 setosa setosa
#2 4.9 3.0 1.4 0.2 setosa setosa
#3 4.7 3.2 1.3 0.2 setosa setosa
#...
#69 6.2 2.2 4.5 1.5 versicolor versicolor
#70 5.6 2.5 3.9 1.1 versicolor versicolor
#71 5.9 3.2 4.8 1.8 versicolor virginica
#72 6.1 2.8 4.0 1.3 versicolor versicolor
#...
#83 5.8 2.7 3.9 1.2 versicolor versicolor
#84 6.0 2.7 5.1 1.6 versicolor virginica
#85 5.4 3.0 4.5 1.5 versicolor versicolor
#...
#132 7.9 3.8 6.4 2.0 virginica virginica
#133 6.4 2.8 5.6 2.2 virginica virginica
#134 6.3 2.8 5.1 1.5 virginica versicolor
#...
You can see above that discrepancies in the classification assigned using LDA, these discrepancies should agree with the confusion table.
## we can add the LD coordinates
<- cbind(iris_dataset, predict(iris_lda_model)$x)
iris_dataset head(iris_dataset)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species lda_species_prediction LD1 LD2
#1 5.1 3.5 1.4 0.2 setosa setosa 8.061800 0.3004206
#2 4.9 3.0 1.4 0.2 setosa setosa 7.128688 -0.7866604
#3 4.7 3.2 1.3 0.2 setosa setosa 7.489828 -0.2653845
#4 4.6 3.1 1.5 0.2 setosa setosa 6.813201 -0.6706311
#5 5.0 3.6 1.4 0.2 setosa setosa 8.132309 0.5144625
#6 5.4 3.9 1.7 0.4 setosa setosa 7.701947 1.4617210
## we subset the iris_dataset to the original classification
<- subset(iris_dataset, select = c(Species, LD1,LD2))
iris_dataset_original head(iris_dataset_original)
# Species LD1 LD2
#1 setosa 8.061800 0.3004206
#2 setosa 7.128688 -0.7866604
#3 setosa 7.489828 -0.2653845
#4 setosa 6.813201 -0.6706311
#5 setosa 8.132309 0.5144625
#6 setosa 7.701947 1.4617210
## show original groups
ggplot(iris_dataset, aes(LD1, LD2)) +
geom_point(aes(color = Species)) +
ggtitle("Iris Original Groups")
## show LDA predicted groups
ggplot(iris_dataset, aes(LD1, LD2)) +
geom_point(aes(color = lda_species_prediction)) +
ggtitle("Iris LDA Predicted Groups")
NOTE sometimes the ggplot()
might need to run twice without closing the plot window to produce a correct plot
Can you find the changes on the classification between original and LDA predicted?
8.4 Cluster analysis: k-means (unsupervised)
We can use our own data without defining groups and ask if an algorithm used to find homogeneous subgroups in our data. One of such approaches is k-means clustering and we can use the preloaded function kmeans()
.
We will use the iris_dataset
and assume that there are 3 groups (as we know about such dataset and its three putative species).
## Let's check again this dataset
<- iris
iris_dataset head(iris_dataset)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
## We need to remove the column with species assignments
<- iris_dataset[,-which(names(iris_dataset) %in% c("Species"))]
iris_dataset_num head(iris_dataset_num)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
Notice how I removed the “Species” column, this is one of many forms to remove columns by name. Now we determine the 3 groups with centers = 3
by repeating 30 times using the argument nstart = 30
with the function kmeans()
.
<- kmeans(iris_dataset_num, centers = 3, nstart = 30)
iris_dataset_kmeans
iris_dataset_kmeans#K-means clustering with 3 clusters of sizes 50, 62, 38
#
#Cluster means:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.006000 3.428000 1.462000 0.246000
#2 5.901613 2.748387 4.393548 1.433871
#3 6.850000 3.073684 5.742105 2.071053
#
#Clustering vector:
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [70] 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3
#[139] 2 3 3 3 2 3 3 3 2 3 3 2
#
#Within cluster sum of squares by cluster:
#[1] 15.15100 39.82097 23.87947
# (between_SS / total_SS = 88.4 %)
#
#Available components:
#
#[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
You can check how good the classification was based on the between_SS / total_SS
value. The closer this value is to 100%, the better it is. In this case, we got 88.4%, which is very good. We can add these groupings provided by the Clustering vector
that is called with iris_dataset_kmeans$cluster
and add those to our dataframe and compare it those with the original species names.
## add the cluster vector to iris_dataset
$cluster
iris_dataset_kmeans# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [70] 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3
#[139] 2 3 3 3 2 3 3 3 2 3 3 2
$kmeans_cluster <- factor(iris_dataset_kmeans$cluster)
iris_dataset_originalhead(iris_dataset_original)
# Species LD1 LD2 kmeans_cluster
#1 setosa 8.061800 0.3004206 1
#2 setosa 7.128688 -0.7866604 1
#3 setosa 7.489828 -0.2653845 1
#4 setosa 6.813201 -0.6706311 1
#5 setosa 8.132309 0.5144625 1
#6 setosa 7.701947 1.4617210 1
## we can plot these groups using the lda coordinates to visualize.
ggplot(iris_dataset_original, aes(LD1, LD2)) +
geom_point(aes(color = kmeans_cluster)) +
ggtitle("Iris kmeans with 3 Groups")