R- FactoMiner MCA How to select Important Features? - r

My dataset is a mixture of Numeric, and categorical Values, Outcome is a Class Label, there are around 400 columns and the dataset contains missing values. There are many Questions in my mind. First is :
How to deal with missing Values ? I replaced all missing values with -1, is it okay ??
How to apply MCA Factor analysis on this data ? Shall I combine train and test then apply MCA ?
How to interpret output of MCA Analysis to get most relevant features ?

Do not touch your dataset
If you use FactoMineR package it handle missing Values itself.
You have to try this kind of code
library(FactoMineR)
library(factoextra)
df <- data.frame(df) # Dataset with only categorical variables
res.mca <- MCA(df, quali.sup)
# Visualize Principal Components
fviz_eig(res.mca,
addlabels = TRUE)
# Individual plot
fviz_mca_ind(res.mca,
col.ind = "cos2",
axes = c(1,2), # axes by default
repel = TRUE)
# Variable plot on axe 1
fviz_contrib(res.mca,
choice = "var",
axes = 1, # you can switch with the other axes
top = 10)
# Best variable contribution
fviz_mca_var(res.mca, col.var = "contrib",
axes = c(1,2),
repel = TRUE)
Interpretation looks like PCA.
Visualize Principal Components (CP) : see %information of each variables
Individual & Variable plots : bring out correlations variables and outliers
Contribution : see %variable contribution on each axes

Related

Visualizing PCA with large number of variables in R using ggbiplot

I am trying to visualize a PCA that includes 87 variables.
prc <-prcomp(df[,1:87], center = TRUE, scale. = TRUE)
ggbiplot(prc, labels = rownames(df[,1:87]), var.axes = TRUE)
When I create the biplot, many of the vectors overlap with each other, making it impossible to read the labels. I was wondering if there is any way to only show some of the labels at a time. For example, I think it'd be useful if I could create a few separate biplots with each one showing only a subset of the labels on the vectors.
This question seems closely related, but I don't know if it translates to the latest version of ggbiplot. I'm also not sure how to modify the original functions.
A potential solution is to use the factoextra package to visualize your PCA results. The fviz_pca_biplot() function includes a repel argument. When repel = TRUE the plot labels are spread out to minimize overlap. There are also select.var options mentioned in the documentation, such as select.var = list(contrib=5) to display only the 5 most influential vectors. Also a select.var = list(name) option that seems to allow for the specification of a specific subset of variables that you want shown.
# read data
df <- mtcars[, c(1:7,10:11)]
# perform PCA
library("FactoMineR")
res.pca <- PCA(df, graph = FALSE)
# visualize
library(factoextra)
fviz_pca_biplot(res.pca, repel = TRUE, select.var = list(contrib = 5))

Plotting R2 of each/certain PCA component per wavelength with R

I have some experience in using PCA, but this is the first time I am attempting to use PCA for spectral data...
I have a large data with spectra where I used prcomp command to calculated PCA for the whole dataset. My results show that 3 components explain 99% of the variance.
I would like to plot the contribution of each of the three PCA components at every wavelength (in steps of 4, 200-1000 nm) like the example of a plot 2 I found on this site:
https://learnche.org/pid/latent-variable-modelling/principal-component-analysis/pca-example-analysis-of-spectral-data
Does anyone have a code how I could do this in R?
Thank you
I believe the matrix of variable loadings is found in model.pca$rotation, see prcomp documentation.
So something like this should do (using the example on your linked website):
file <- 'http://openmv.net/file/tablet-spectra.csv'
spectra <- read.csv(file, header = FALSE)
n.comp <- 4
model.pca <- prcomp(spectra[,2:651],
center = TRUE,
scale =TRUE,
rank. = n.comp)
summary(model.pca)
par(mfrow=c(n.comp,1))
sapply(1:n.comp, function(comp){
plot(2:651, model.pca$rotation[,comp], type='l', lwd=2,
main=paste("Comp.", comp), xlab="Wavelength INDEX")
})
I don't have the wavelength values, so I used the indices of the array here ; output below.

HCPC r function - difference between cluster data and cluster visualisation

I'm using the package FactoMiner and its function HCPC in order to create a segmentation of some observations. Then I used the function plot.HCPC(), and I observed differences between two alternatives of this function (two alternatives illustrating the same results ...)
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(pca, graph = FALSE)
If I used choice = 'map', we see that Arkansas is in the green cluster, but if I used choice = 'tree', Arkansas is in the red cluster ! (other states of the green cluster stay in the green cluster from map to dendrogram/tree) :
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
According to the numeric results (hcpc$data.clust), there are 8 observations in the cluster3 (green cluster), which matches the 'map' visualisation (but not the dendrogram/tree visualisation).
Do you know if I did something wrong, if I missed something important?
In the HCPC function one of the first argument is Consol=T:
Consol a boolean. If TRUE, a k-means consolidation is performed
(consolidation cannot be performed if kk is used and equals a number).
library(FactoMineR)
data(USArrests)
pca <- PCA(USArrests, ncp = 3, graph = FALSE)
hcpc <- HCPC(res.pca,consol=F, graph = FALSE)
plot(hcpc, choice = 'map')
plot(hcpc, choice = 'tree')
Hope it will help you

How to draw line around significant values in R's corrplot package

I have been asked to obtain a correlation plot for a colaborator.
My choice is to use R for the task, specifically the corrplot package.
I have been researching on the internet and I found multiple ways to obtain such graphics, but not the specific graphic I was asked for (as you can see in the picture the significant values are highlighted by drawing a square around the significant tile), which is puzzling me.
Example of the correlation plot required
The closest result I achieve is using the code under this lines, but I do not seem to be able to find the option to draw line around the significant tiles (if exists).
#Insignificant correlations are leaved blank
corrplot(res3$r, type="upper", order="hclust",
p.mat = res3$P, sig.level = 0.01, insig = "blank")
I tried adding the "addrect" parameter but it didn't work.
#Insignificant correlation are crossed
corrplot(res3$r, type="upper", order="hclust", p.mat = res3$P,
addrect=2, sig.level = 0.01, insig = "blank")
Any help will be appreciated.
corrplot allows you to add new plots to an already existing one. Therefore, once you've created the plot of the initial correlation matrix, you can simply add those cells that you want to highlight in an iterative manner using corrplot(..., add = TRUE).
The only thing required to achieve your goal is an indices vecor (which I called 'ids') to tell R which cells to highlight. Note that for reasons of simplicity, I took a random sample of the initial correlation matrix, but things like ids <- which(p.value < 0.01) (assuming that you've stored your significance levels in a separate vector) would work similarly.
library(corrplot)
## create and visualize correlation matrix
data(mtcars)
M <- cor(mtcars)
corrplot(M, cl.pos = "n", na.label = " ")
## select cells to highlight (e.g., statistically significant values)
set.seed(10)
ids <- sample(1:length(M), 15L)
## duplicate correlation matrix and reject all irrelevant values
N <- M
N[-ids] <- NA
## add significant cells to the initial corrplot iteratively
for (i in ids) {
O <- N
O[-i] <- NA
corrplot(O, cl.pos = "n", na.label = " ", addgrid.col = "black", add = TRUE,
bg = "transparent", tl.col = "transparent")
}
Note that you could also add all values to highlight in one go (i.e., without requiring a for loop) using corrplot(N, ...), but in that case, an undesirable black margin is drawn all around the plotting area.

How to get a good dendrogram using R

I am using R to do a hierarchical cluster analysis using the Ward's squared euclidean distance. I have a matrix of x columns(stations) and y rows(numbers in float), the first row contain the header(stations' names). I want to have a good dendrogram where the name of the station appear at the bottom of the tree as i am not able to interprete my result. My aim is to find those stations which are similar. However using the following codes i am having numbers (100,101,102,...) for the lower branches.
Yu<-read.table("yu_s.txt",header = T, dec=",")
library(cluster)
agn1 <- agnes(Yu, metric = "euclidean", method="ward", stand = TRUE)
hcd<-as.dendrogram(agn1)
par(mfrow=c(3,1))
plot(hcd, main="Main")
plot(cut(hcd, h=25)$upper,
main="Upper tree of cut at h=25")
plot(cut(hcd, h=25)$lower[[2]],
main="Second branch of lower tree with cut at h=25")
A nice collection of examples are present here (http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html)
Two methods:
with hclust from base R
hc<-hclust(dist(mtcars),method="ward")
plot(hc)
Default plot
ggplot
with ggplot and ggdendro
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(hc, rotate = TRUE, size = 4, theme_dendro = FALSE)

Resources