HCPC in FactomineR: How to count individuals in Clusters? - r

the title says it all. I performed a multiple correspondence analysis (MCA) in FactomineR with factoshiny and did an HPCP afterwards. I now have 3 clusters on my 2 dimensions. While the factoshiny interface really helps visualize and navigate the analysis easily, I can't find a way to count the individuals in my clusters. Additionally, I would love to assign the clustervariables to the individuals on my dataset. Those operations are easily performed with hclust, but their algorithms don't work on categorical data.
##dummy dataset
x <- as.factor(c(1,1,2,1,3,4,3,2,1))
y <- as.factor(c(2,3,1,4,4,2,1,1,1))
z <- as.factor(c(1,2,1,1,3,4,2,1,1))
data <- data.frame(x,y,z)
# used packages
library(FactoMineR)
library(Factoshiny)
# the function used to open factoshiny in your browser
res.MCA <- Factoshiny(data)
# factoshiny code:
# res.MCA<-MCA(data,graph=FALSE)
# hcpc code in factoshiny
res.MCA<-MCA(data,ncp=8,graph=FALSE)
res.HCPC<-HCPC(res.MCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='Hierarchical tree on the factor map')
I now want a variable data$cluster with 3 levels so that I can count the individuals in the clusters.

To anyone encountering a similar problem, this helped:
res.HCPC$data.clust # returns all values and cluster membership for every individual
res.HCPC$data.clust[1,]$clust # for the first individual
table(res.HCPC$data.clust$clust) # gives table of frequencies per cluster

Related

How to merge unsupervised hierarchical clustering result with the original data

I carried out an unsupervised hierarchical cluster analysis in R. My data are numbers in 3 columns and around 120,000 rows. I managed to use cut tree and recognised 6 clusters. Now, I need to return these clusters to the original data, i.e. add another column indicating the cluster group (1 of 6). How I can do that?
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 6 groups
sub_grp <- cutree(hc5, k = 6)
# Number of members in each cluster
table(sub_grp)
I need that as my data got spatial links, hence I would like to map the clusters back to their location on a map.
I appreciate your help.
The variable sub_grp is just a vector of cluster assignments so you can just add it to the data frame:
data(iris) # Data frame available in base R.
str(iris)
d <- dist(iris[, -5]) # Column 5 is the species name so we drop it
hc5 <- hclust(d, method="ward.D2")
sub_grp <- cutree(hc5, k=3)
str(sub_grp)
iris$grp <- sub_grp
str(iris)
aggregate(iris[, 1:4,], by=list(iris$grp), mean)
xtabs(~grp+Species, iris)
The last two commands compute the means by groups for the 4 numeric variables and cross-tabulate the cluster assignments with the known species. You don't actually need to add the cluster assignment to the data frame. R lets you combine variables from different objects as long as they have the same number of rows.

ANOSIM with cutree groupings

What i would like to do is an ANOSIM of defined groupings in some assemblage data to see whether the groupings are significantly different from one another, in a similar fashion to this example code:
data(dune)
data(dune.env)
dune.dist <- vegdist(dune)
attach(dune.env)
dune.ano <- anosim(dune.dist, Management)
summary(dune.ano)
However in my own data I have the species abundance in a bray-curtis matrices and after creating hclust() diagrams and creating my own groupings visually by looking at the dendrogram and setting the height. I can then through cutree() get these groupings which can be superimposed on MDS plots etc. but I would like to check the significance of the similarity between the groupings i have created - i.e are the groupings significantly different or just arbitrary groupings?
e.g.
data("dune")
dune.dist <- vegdist(dune)
clua <- hclust(dune.dist, "average")
plot(clua)
rect.hclust(clua, h =0.65)
c1 <- cutree(clua, h=0.65)
I then want to use the c1 defined category as the groupings, which in the example code given was the management factor, and test their similarities to see whether they are actually different via anosim().
I am pretty sure this is just a matter of my inept coding.... any advice would be appreciated.
cutree returns groups as integers: you must change these to factors if you want to use them in anosim: Try anosim(vegdist(dune), factor(c1)). You better contact a local statistician for using anosim to analyse dissimilarities using clusters created from these very same dissimilarities.

Discrepancy in results when using k-means and plotting the distance matrix. Why?

I am doing cluster of some data in R Studio. I am having a problem with results of K-means Cluster Analysis and plotting Hierarchical Clustering. So when I use function kmeans, I get 4 groups with 10, 20, 30 and 6 observations. Nevertheless, when I plot the dendogram, I get 4 groups but with different numbers of observations: 23, 26, 10 and 7.
Have you ever found a problem like this?
Here you are my code:
mydata<-scale(mydata0)
# K-Means Cluster Analysis
fit <- kmeans(mydata, 4) # 4 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydatafinal <- data.frame(mydata, fit$cluster)
fit$size
[1] 10 20 30 6
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit2 <- hclust(d, method="ward.D2")
plot(fit2,cex=0.4) # display dendogram
groups <- cutree(fit2, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(fit2, k=4, border="red")
Results of k-means and hierarchical clustering do not need to be the same in every scenario.
Just to give an example, everytime you run k-means the initial choice of the centroids is different and so results are different.
This is not surprising. K-means clustering is initialised at random and can give distinct answers. Typically one tends to do several runs and then aggregate the results to check which are the 'core' clusters.
Hierarchical clustering is, in contrast, purely deterministic as there is no randomness involved. But like K-means, it is a heuristic: a set of rules is followed to create clusters with no regard to any underlying objective function (for example the intra- and inter- cluster variance vs overall variance). The way existing clusters are aggregated to individual observations is crucial in determining the size of the formed clusters (the "ward.D2" parameter you pass as method in the hclust command).
Having a properly defined objective function to optimise should give you a unique answer (or set thereof) but the problem is NP-hard, because of the sheer size (as a function of the number of observations) of the partitioning involved. This is why only heuristics exist and also why any clustering procedure should not be seen as a tool giving definitive answers but as an exploratory one.

Clustering and Heatmap on microarray data using R

I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?

k-means clustering-- why all same clusters?

I am running a k-means clustering on a set of text data with 10842 number of tweets. I set the k to be 5 and I got my clusters as per below
cluster1:booking flight NA
cluster2:flight booking NA
cluster3:flight booking NA
cluster4:flight booking NA
cluster5:booking flight NA
I do not understand why all the clusters are same??
myCorpus<-Corpus(VectorSource(myCorpus$text))
myCorpusCopy<-myCorpus
myCorpus<-tm_map(myCorpus,stemDocument)
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
myTdm<-TermDocumentMatrix(myCorpus,control=list(wordLengths=c(1,Inf)))
myTdm2<-removeSparseTerms(myTdm,sparse=0.95)
m2<-as.matrix(myTdm2)
m3<-t(m2)
set.seed(122)
k<-5
kmeansResult<-kmeans(m3,k)
round(kmeansResult$centers,digits=3)
for(i in 1:k){
cat(paste("cluster",i,":",sep=""))
s<-sort(kmeansResult$centers[i,],decreasing=T)
cat(names(s)[1:3],"\n")
}
Keep in mind that k-means clustering requires you to specify the number of clusters in advance (in contrast to, say, hierarchical clustering). Without having access to your data set (and thus being unable to reproduce what you've presented here), the most obvious reason that you're obtaining seemingly homogeneous clusters is that there's a problem with the number of clusters you're specifying beforehand.
The most immediate solution is to try out the NbClust package in R to determine the number of clusters appropriate for your data.
Here's a sample code using a toy data set to give you an idea of how to proceed:
# install.packages("NbClust")
library(NbClust)
set.seed(1234)
df <- rbind(matrix(rnorm(100,sd=0.1),ncol=2),
matrix(rnorm(100,mean=1,sd=0.2),ncol=2),
matrix(rnorm(100,mean=5,sd=0.1),ncol=2),
matrix(rnorm(100,mean=7,sd=0.2),ncol=2))
# "scree" plots on appropriate number of clusters (you should look
# for a bend in the graph)
nc <- NbClust(df, min.nc=2, max.nc=20, method="kmeans")
table(nc$Best.n[1,])
# creating a bar chart to visualize results on appropriate number
# of clusters
barplot(table(nc$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by Criteria")
If you still run into problems even after specifying the number of clusters
suggested by the functions in the NbClust package, then another problem
could be with your removal of sparse terms. Try adjusting the "sparse"
option downward and then examine the output from the k-means clustering.

Resources