HCPC r function - difference between cluster data and cluster visualisation - r

I'm using the package FactoMiner and its function HCPC in order to create a segmentation of some observations. Then I used the function plot.HCPC(), and I observed differences between two alternatives of this function (two alternatives illustrating the same results ...)
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(pca, graph = FALSE)
If I used choice = 'map', we see that Arkansas is in the green cluster, but if I used choice = 'tree', Arkansas is in the red cluster ! (other states of the green cluster stay in the green cluster from map to dendrogram/tree) :
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
According to the numeric results (hcpc$data.clust), there are 8 observations in the cluster3 (green cluster), which matches the 'map' visualisation (but not the dendrogram/tree visualisation).
Do you know if I did something wrong, if I missed something important?

In the HCPC function one of the first argument is Consol=T:
Consol a boolean. If TRUE, a k-means consolidation is performed
(consolidation cannot be performed if kk is used and equals a number).
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(res.pca,consol=F, graph = FALSE)
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
Hope it will help you

Related

Is it possible to color kmeans results according to its annotation, rather than by the clustering results?

I have labeled data and I'd like to estimate whether clustering results agree with these labels.
In hierarchical clustering, I was able to do so using:
pheatmap(data,annotation_col=metadata,annotation_row=metadata,annotation_colors=anno_colors)
Is it possible to do so also with kmeans results? I tried fviz_cluster but I could not find how to color each point according to its annotation and not by its cluster.
I did not find how to compare known labels with kmeans clustering results in one figure using fviz_cluster.
Instead I colored the dots according to their labels, and shape the dots according to their clusters.
iris.scaled <- scale(iris[, 1:4])
pc <- prcomp(iris.scaled)
km <- kmeans(iris.scaled, 3, nstart = 100)
pca.var <- pc$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
# Color according to iris$Species
typeSp<-as.vector(iris$Species)
colvec<-typeSp
colvec[is.element(typeSp,"setosa")] = "red"
colvec[is.element(typeSp,"versicolor")] = "blue"
colvec[is.element(typeSp,"virginica")] = "magenta"
# Shape according to kmeans cluster
typeCl<-km$cluster
pchvec<-1:length(colvec)
pchvec[is.element(typeCl,1)] = 0
pchvec[is.element(typeCl,2)] = 19
pchvec[is.element(typeCl,3)] = 17
plot(pc$x[,1], pc$x[,2], col=colvec, pch=pchvec,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"),
main="Compare known label (colors) to Kmeans cluster (shapes)")
legend("topright",legend=c("setosa","versicolor","virginica"),
col=c("blue","red","magenta"),
cex=0.9, pch=c(8,8,8))

Visualizing PCA with large number of variables in R using ggbiplot

I am trying to visualize a PCA that includes 87 variables.
prc <-prcomp(df[,1:87], center = TRUE, scale. = TRUE)
ggbiplot(prc, labels = rownames(df[,1:87]), var.axes = TRUE)
When I create the biplot, many of the vectors overlap with each other, making it impossible to read the labels. I was wondering if there is any way to only show some of the labels at a time. For example, I think it'd be useful if I could create a few separate biplots with each one showing only a subset of the labels on the vectors.
This question seems closely related, but I don't know if it translates to the latest version of ggbiplot. I'm also not sure how to modify the original functions.
A potential solution is to use the factoextra package to visualize your PCA results. The fviz_pca_biplot() function includes a repel argument. When repel = TRUE the plot labels are spread out to minimize overlap. There are also select.var options mentioned in the documentation, such as select.var = list(contrib=5) to display only the 5 most influential vectors. Also a select.var = list(name) option that seems to allow for the specification of a specific subset of variables that you want shown.
# read data
df <- mtcars[, c(1:7,10:11)]
# perform PCA
library("FactoMineR")
res.pca <- PCA(df, graph = FALSE)
# visualize
library(factoextra)
fviz_pca_biplot(res.pca, repel = TRUE, select.var = list(contrib = 5))

FactoMineR/factoextra visualize all the clusters in the dendrogram

I performed a hierarchical clustering on a dataframe using the HCPC function of the package FactoMineR. Problem is, I cannot visualize the number of clusters I asked when I draw the dendrogram using factoextra.
Here is below a reproducible example of my problem
model <- HCPC(iris[,1:4], nb.clust = 5)
there are indeed 5 clusters above
fviz_dend(model, k = 5,
cex = 0.7,
palette = "default",
rect = TRUE, rect_fill = TRUE,
)
But just 3 mapped within the dendrogram
I bumped into the same problem: the fviz_dend function would always return what it considers to be the optimal amount of clusters, even when I tried to override this – either in the HCPC or in the fviz_dend functions.
One way to fix this while sticking to FactoMineR and factoextra would be to change the default amount of clusters calculated by the HCPC function:
model$call$t$nb.clust = 5
And then run the fviz_dend function.
This should return the result that you were expecting.
You can just use the dendextend R package with the color_branches function:
library(dendextend)
dend <- USArrests %>% dist %>% hclust(method = "ave") %>% as.dendrogram
dd <- color_branches(dend,5)
plot(dd)

R- FactoMiner MCA How to select Important Features?

My dataset is a mixture of Numeric, and categorical Values, Outcome is a Class Label, there are around 400 columns and the dataset contains missing values. There are many Questions in my mind. First is :
How to deal with missing Values ? I replaced all missing values with -1, is it okay ??
How to apply MCA Factor analysis on this data ? Shall I combine train and test then apply MCA ?
How to interpret output of MCA Analysis to get most relevant features ?
Do not touch your dataset
If you use FactoMineR package it handle missing Values itself.
You have to try this kind of code
library(FactoMineR)
library(factoextra)
df <- data.frame(df) # Dataset with only categorical variables
res.mca <- MCA(df, quali.sup)
# Visualize Principal Components
fviz_eig(res.mca,
addlabels = TRUE)
# Individual plot
fviz_mca_ind(res.mca,
col.ind = "cos2",
axes = c(1,2), # axes by default
repel = TRUE)
# Variable plot on axe 1
fviz_contrib(res.mca,
choice = "var",
axes = 1, # you can switch with the other axes
top = 10)
# Best variable contribution
fviz_mca_var(res.mca, col.var = "contrib",
axes = c(1,2),
repel = TRUE)
Interpretation looks like PCA.
Visualize Principal Components (CP) : see %information of each variables
Individual & Variable plots : bring out correlations variables and outliers
Contribution : see %variable contribution on each axes

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

Resources