Visualizing PCA with large number of variables in R using ggbiplot - r

I am trying to visualize a PCA that includes 87 variables.
prc <-prcomp(df[,1:87], center = TRUE, scale. = TRUE)
ggbiplot(prc, labels = rownames(df[,1:87]), var.axes = TRUE)
When I create the biplot, many of the vectors overlap with each other, making it impossible to read the labels. I was wondering if there is any way to only show some of the labels at a time. For example, I think it'd be useful if I could create a few separate biplots with each one showing only a subset of the labels on the vectors.
This question seems closely related, but I don't know if it translates to the latest version of ggbiplot. I'm also not sure how to modify the original functions.

A potential solution is to use the factoextra package to visualize your PCA results. The fviz_pca_biplot() function includes a repel argument. When repel = TRUE the plot labels are spread out to minimize overlap. There are also select.var options mentioned in the documentation, such as select.var = list(contrib=5) to display only the 5 most influential vectors. Also a select.var = list(name) option that seems to allow for the specification of a specific subset of variables that you want shown.
# read data
df <- mtcars[, c(1:7,10:11)]
# perform PCA
library("FactoMineR")
res.pca <- PCA(df, graph = FALSE)
# visualize
library(factoextra)
fviz_pca_biplot(res.pca, repel = TRUE, select.var = list(contrib = 5))

Related

Understanding the output from a factor analysis using the FAMD function

I have some survey data where people were asked questions and given a yes or no option (1=yes, 0=no). I would like to be able to pick out some patterns in this data.
The questions are:
Do you enjoy XX work?
Do you do XX work alone?
Has your workload increased?
Do you have a backlog of work?
I would like to know whether people who work alone are more likely to have an increased workload, a backlog of work and not enjoy their job. To answer this, I think factor analysis is the way to go but I'm struggling to interpret the output.
Here is an example of my data:
enjoy <- c(1,1,0,1,0,1,0,0,0,1)
alone <- c(0,0,1,1,1,0,0,1,1,0)
workload <- c(0,0,1,1,0,1,0,0,0,1)
backlog <- c(0,0,1,1,0,1,0,0,0,0)
data <- data.frame(enjoy, alone, workload, backlog)
data <- data %>% mutate_if(sapply(data, is.numeric), as.character) ## convert from numeric to categorical
I'm using the FAMD function in factomineR as this can use categorical data.
library(FactoMineR)
data_famd <- FAMD(data, graph = FALSE)
Then using factoextra, I can see which variables contribute to each axis
library(factoextra)
# Contribution to the first dimension
fviz_contrib(data_famd, "var", axes = 1) ## backlog & workload
# Contribution to the second dimension
fviz_contrib(data_famd, "var", axes = 2) ## enjoy and alone
Then I can make this plot:
fviz_mfa_ind(data_famd,
habillage = "alone", # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, ellipse.type = "confidence",
repel = TRUE) # Avoid text overlapping
This looks like that people who work alone vs not alone answer questions differently. But I don't understand what answers people who work alone (yellow) are giving vs people who don't work alone. They are clearly distinct so are doing something differently.
My main question is: What do the axes mean? I've done PCA's using continuous data before and using the loadings I can figure out what the axes mean, and therefore interpret these graphs. How do you do this for a factor analysis? Is there a different package?
Thanks for any help.

Inconsistent clustering with ComplexHeatmap?

So I'm trying to generate a heatmap for my data using Bioconductor's ComplexHeatmap package, but I get slightly different results depending on whether I make the dendrogram myself, or tell Heatmap to make it.
Packages:
require(ComplexHeatmap)
require(dendextend)
Data:
a=rnorm(400,1)
b=as.matrix(a)
dim(b)=c(80,5)
If I make the dendrogram myself:
d=dist(b,method="euclidean")
d=as.dist(d)
h=hclust(d,method="ward.D")
dend=as.dendrogram(h)
Heatmap(b,
cluster_columns=FALSE,
cluster_rows = dend)
Versus having Heatmap do the clustering:
Heatmap(b,
cluster_columns=FALSE,
clustering_distance_rows = "euclidean",
clustering_method_rows = "ward.D")
They tend to look very similar, but they'll be very slightly different.
And this matters a lot for my data. Heatmap's clustering ends up organizing my data way, way better, however, I also want to extract the list of clustered items via like cutree(), but I don't think I can extract it from Heatmap's clustering.
Does anyone know what's going on?
the dendrograms are the same. The only thing that changes is the ordering. You can verify this using:
hmap1 <- Heatmap(b,
cluster_columns=FALSE,
cluster_rows = dend)
hmap2 <- Heatmap(b,
cluster_columns=FALSE,
clustering_distance_rows = "euclidean",
clustering_method_rows = "ward.D")
#Reorder both row dendrograms using the same weights:
rowdend1 <- reorder(row_dend(hmap1)[[1]], 1:80)
rowdend2 <- reorder(row_dend(hmap2)[[1]], 1:80)
#check that they are identical:
identical( rowdend1, rowdend2)
## [1] TRUE
The ComplexHeatmap::Heatmap function has an argument row_dend_reorder with default value TRUE that you should check.

Run points() after plot() on a dataframe

I'm new to R and want to plot specific points over an existing plot. I'm using the swiss data frame, which I visualize through the plot(swiss) function.
After this, want to add outliers given by the Mahalanobis distance:
mu_hat <- apply(swiss, 2, mean); sigma_hat <- cov(swiss)
mahalanobis_distance <- mahalanobis(swiss, mu_hat, sigma_hat)
outliers <- swiss[names(mahalanobis_distance[mahalanobis_distance > 10]),]
points(outliers, pch = 'x', col = 'red')
but this last line has no effect, as the outlier points aren't added to the previous plot. I see that if repeat this procedure on a pair of variables, say
plot(swiss[2:3])
points(outliers[2:3], pch = 'x', col = 'red')
the red points are added to the plot.
Ask: is there any restriction to how the points() function can be used for a multivariate data frame?
Here's a solution using GGally::ggpairs. It's a little ugly as we need to modify the ggally_points function to specify the desired color scheme.
I've assumed that mu_hat = colMeans(swiss) and sigma_hat = cov(swiss).
library(dplyr)
library(GGally)
swiss %>%
bind_cols(distance = mahalanobis(swiss, colMeans(swiss), cov(swiss))) %>%
mutate(is_outlier = ifelse(distance > 10, "yes", "no")) %>%
ggpairs(columns = 1:6,
mapping = aes(color = is_outlier),
upper = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
lower = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
axisLabels = "internal")
Unfortunately this isn't possible the way you're currently doing things. When plotting a data frame R produces many plots and aligns them. What you're actually seeing there is 6 by 6 = 36 individual plots which have all been aligned to look nice.
When you use the dots command, it tells it to place the dots on the current plot. Which doesn't really make sense when you have 36 plots, at least not the way you want it to.
ggplot is a really powerful tool in R, it provides far greater combustibility. For example you could set up the dataframe to include your outliers, but have them labelled as "outlier" and place it in each plot that you have set up as facets. The more you explore it you might find there are better plots which suit your needs as well.
Plotting a dataframe in base R is a good exploratory tool. You could set up those outliers as a separate dataframe and plot it, so you can see each of the 6 by 6 plots side by side and compare. It all depends on your goal. If you're goal is to produce exactly as you've described, the ggplot2 package will help you create something more professional. As #Gregor suggested in the comments, looking up the function ggpairs from the GGally package would be a good place to start.
A quick google image search shows some funky plots akin to what you're after and then some!
Find it here

R- FactoMiner MCA How to select Important Features?

My dataset is a mixture of Numeric, and categorical Values, Outcome is a Class Label, there are around 400 columns and the dataset contains missing values. There are many Questions in my mind. First is :
How to deal with missing Values ? I replaced all missing values with -1, is it okay ??
How to apply MCA Factor analysis on this data ? Shall I combine train and test then apply MCA ?
How to interpret output of MCA Analysis to get most relevant features ?
Do not touch your dataset
If you use FactoMineR package it handle missing Values itself.
You have to try this kind of code
library(FactoMineR)
library(factoextra)
df <- data.frame(df) # Dataset with only categorical variables
res.mca <- MCA(df, quali.sup)
# Visualize Principal Components
fviz_eig(res.mca,
addlabels = TRUE)
# Individual plot
fviz_mca_ind(res.mca,
col.ind = "cos2",
axes = c(1,2), # axes by default
repel = TRUE)
# Variable plot on axe 1
fviz_contrib(res.mca,
choice = "var",
axes = 1, # you can switch with the other axes
top = 10)
# Best variable contribution
fviz_mca_var(res.mca, col.var = "contrib",
axes = c(1,2),
repel = TRUE)
Interpretation looks like PCA.
Visualize Principal Components (CP) : see %information of each variables
Individual & Variable plots : bring out correlations variables and outliers
Contribution : see %variable contribution on each axes

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

Resources