I carried out an unsupervised hierarchical cluster analysis in R. My data are numbers in 3 columns and around 120,000 rows. I managed to use cut tree and recognised 6 clusters. Now, I need to return these clusters to the original data, i.e. add another column indicating the cluster group (1 of 6). How I can do that?
# Ward's method
hc5 <- hclust(d, method = "ward.D2" )
# Cut tree into 6 groups
sub_grp <- cutree(hc5, k = 6)
# Number of members in each cluster
table(sub_grp)
I need that as my data got spatial links, hence I would like to map the clusters back to their location on a map.
I appreciate your help.
The variable sub_grp is just a vector of cluster assignments so you can just add it to the data frame:
data(iris) # Data frame available in base R.
str(iris)
d <- dist(iris[, -5]) # Column 5 is the species name so we drop it
hc5 <- hclust(d, method="ward.D2")
sub_grp <- cutree(hc5, k=3)
str(sub_grp)
iris$grp <- sub_grp
str(iris)
aggregate(iris[, 1:4,], by=list(iris$grp), mean)
xtabs(~grp+Species, iris)
The last two commands compute the means by groups for the 4 numeric variables and cross-tabulate the cluster assignments with the known species. You don't actually need to add the cluster assignment to the data frame. R lets you combine variables from different objects as long as they have the same number of rows.
Related
I aim at producing a typology of sites based through hierarchical clustering of species abundance data. Therefore, I successively cut the dendrogram into 2, 3, 4 ... z groups.
The cluster group names are automatically attributed by the function cutree() representing numbers from 1 to z in a non-consistent manner. For instance, in a clustering with three groups, "group 2" may not correspond to "group 2" in a clustering with six groups. This makes interpreting the dendrogram very difficult.
The code below provides a reproducible example. It produces a hierarchical clustering of 50 observations and successively cuts the dendrogram in a for loop. The final output data frame 'cluster.grps' contains the cluster group affiliation for each obervation and successive cutting height (HC_2 = hierarchical clustering with 2 groups; HC_3 = hc with three groups; etc.).
set.seed(1)
data <- data.frame(replicate(10,sample(0:10,50,rep=TRUE))) # create random site x species dataframe
clust <- hclust(dist(data), method = "ward.D") # implement hierarchical clustering
# Set maximum number of groups
z <- 6
# Loop for successive tree cutting
lst <- list()
for (i in 2:z) {
# Slicing the dendrogram
cutree <- cutree(clust, k = i) # k = number of groups
lst[[(i-1)]] <- cutree
}
names(lst) <- 2:(z-1)
cluster.grps <- as.data.frame(lst)
colnames(cluster.grps) <- paste("HC",as.character(2:(z)),sep ="_")
I now wish to attribute dichotomous names that represent the level of hierarchy in the tree: 1, 2 for the first level; 1.1, 1.2, 2.1, 2.2 for the second level; 1.1.1, 1.1.2, 1.2.1, 1.2.2, etc. for the third level and so on.
Ideally, the table 'cluster.grps' would look like this:
Site
HC_2
HC_3
HC_4
Site 1
1
1.1
1.1
Site 2
2
2
2.1
Site 3
1
1.2
1.2
Site 4
2
2
2.2
My first thought was to code nested clusterings in which I start with a first clustering of all observations into two groups and subsequently splitting each group of the first clustering independently into two consecutive groups, yielding four groups at the second hierarchical level. This requires quite a long code, though and I was wondering whether there might be a more elegant way.
Any thoughts?
the title says it all. I performed a multiple correspondence analysis (MCA) in FactomineR with factoshiny and did an HPCP afterwards. I now have 3 clusters on my 2 dimensions. While the factoshiny interface really helps visualize and navigate the analysis easily, I can't find a way to count the individuals in my clusters. Additionally, I would love to assign the clustervariables to the individuals on my dataset. Those operations are easily performed with hclust, but their algorithms don't work on categorical data.
##dummy dataset
x <- as.factor(c(1,1,2,1,3,4,3,2,1))
y <- as.factor(c(2,3,1,4,4,2,1,1,1))
z <- as.factor(c(1,2,1,1,3,4,2,1,1))
data <- data.frame(x,y,z)
# used packages
library(FactoMineR)
library(Factoshiny)
# the function used to open factoshiny in your browser
res.MCA <- Factoshiny(data)
# factoshiny code:
# res.MCA<-MCA(data,graph=FALSE)
# hcpc code in factoshiny
res.MCA<-MCA(data,ncp=8,graph=FALSE)
res.HCPC<-HCPC(res.MCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='Hierarchical tree on the factor map')
I now want a variable data$cluster with 3 levels so that I can count the individuals in the clusters.
To anyone encountering a similar problem, this helped:
res.HCPC$data.clust # returns all values and cluster membership for every individual
res.HCPC$data.clust[1,]$clust # for the first individual
table(res.HCPC$data.clust$clust) # gives table of frequencies per cluster
I have a dataset where I've measured gene expression of 21 genes and also measured the output of 3 other assays. I have measured these for 8 different clones. I have also measured these on 5 different days.
However, I haven't measured every gene or assay on every day, or for every clone. So I have datasets of varying lengths. In order to easily combine them into one large dataset, to perform a PCA on them, I melted each dataset and then row bound them. I then standardized all the values. I now have a dataset that looks like the below.
What I want to do is a PCA where each of the factors in "group" is calculated in the PCA. Then, I'd like to create graphs where different colors of datapoints represent different "clones" or "days". I've pasted my sad attempt to get that working below. Any help would be appreciated!
set.seed(1)
# Creates variables for a dataset
clone <- sample(c(rep(c("1A","2A","2B","3B","3C"), each=100),rep(c("1B","2C","3A"), each=200)))
day <- sample(c(rep(1,225),rep(2,25),rep(3,600),rep(4,25),rep(5,225)))
group <- sample(c(rep(paste0("gene",1:21), each=42),rep("assay1",90),rep("assay2",80),rep("assay3",48)))
value = rnorm(1100, mean=0, sd=3)
# Create data frame from variables
df <- data.frame(clone,day,group,value)
df$day <- as.factor(df$day)
# Create PCA data
df_PCA <- prcomp(clone + day + group ~ value, data = df, scale = FALSE)
# Graphing results of PCA
par(mfrow=c(2,3))
plot(df_PCA$x[,1:2], col=clone)
plot(df_PCA$x[,1:2], col=day)
plot(df_PCA$x[,1:3], col=clone)
plot(df_PCA$x[,1:3], col=day)
plot(df_PCA$x[,2:3], col=clone)
plot(df_PCA$x[,2:3], col=day)
I have a file with the results of a microarray expression experiment. The first column holds the gene names. The next 15 columns are 7 samples from the post-mortem brain of people with Down's syndrome, and 8 from people not having Down's syndrome. The data are normalized. I would like to know which genes are differentially expressed between the groups.
There are two groups and the data is nearly normally distributed, so a t-test has been performed for each gene. The p-values were added in another column at the end. Afterwards, I did a correction for multiple testing.
I need to cluster the data to see if the differentially expressed genes (FDR<0.05) can discriminate between the groups.
Also, I would like to visualize the clustering using a heatmap with gene names on the rows and some meaningful names on the samples (columns)
I have written this code for the moment:
ds <- read.table("down_syndroms.txt", header=T)
names(ds) <- c("Gene",paste0("Down",1:7),paste0("Control",1:8), "pvalues")
pvadj <- p.adjust(ds$pvalue, method = "BH")
# # How many genes do we get with a FDR <=0.05
sum(pvadj<=0.05)
[1] 5641
# Cluster the data
ds_matrix<-as.matrix(ds[,2:18])
ds_dist_matrix<-dist(ds_matrix)
my_clustering<-hclust(ds_dist_matrix)
# Heatmap
library(gplots)
hm <- heatmap.2(ds_matrix, trace='none', margins=c(12,12))
The heatmap I have done doesn't look the way I would like. Also, I think I should remove the pvalues from it. Besides, R usually crashes when I try to plot the clustering (probably due to the big size of the data file, with more than 22 thousand genes).
How could I do a better looking tree (clustering) and heatmap?
I am using varclus from the Hmisc package in R. Are there ways to produce summary tables from varclus like those is in SAS (e.g. Output 100.1.2 and Output 100.1.3 ) in R. Basically, I would like to know the information that is contained in the plot in a tabular or matrix form. For example: what variables are in what clusters (in SAS cluster structure), proportion of variance they explain, etc.
# varclust example in R using mtcars data
mtc <- mtcars[,2:8]
mtcn <- data.matrix(mtc)
clust <- varclus(mtcn)
clust
plot(clust)
#cut_tree <- cutree(varclus(mtcn)$hclust, k=5) # This would show group membership, but only after I chose some a cut point, not what I am after