R: Cluster analysis with hclust(). How to get the cluster representatives? - r

I am doing some cluster analysis with R. I am using the hclust() function and I would like to get, after I perform the cluster analysis, the cluster representative of each cluster.
I define a cluster representative as the instances which are closest to the centroid of the cluster.
So the steps are:
Finding the centroid of the clusters
Finding the cluster representatives
I have already asked a similar question but using K-means: https://stats.stackexchange.com/questions/251987/cluster-analysis-with-k-means-how-to-get-the-cluster-representatives
The problem, in this case, is that hclust doesn't give the centroids!
For example, saying that d are my data, what I have done so far is:
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
## getting centroids ##
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, data, groups1)
But now, I was trying to get the cluster representatives with this code (I got it in the other question I asked, for k-means):
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}
And it says that:
"Error in e2[[j]] : index out of the limit"
I would be grateful if any of you could give me a little help. Thanks.
-- (not) Working example of the code --
example_data.txt
A,B,C
10.761719,5.452188,7.575762
10.830457,5.158822,7.661588
10.75391,5.500170,7.740330
10.686719,5.286823,7.748297
10.864527,4.883244,7.628730
10.701415,5.345650,7.576218
10.820583,5.151544,7.707404
10.877528,4.786888,7.858234
10.712337,4.744053,7.796390
As for the code:
# Install R packages
#install.packages("fpc")
#install.packages("cluster")
#install.packages("rgl")
library(fpc)
library(cluster)
library(rgl)
CV <- read.csv("example_data")
str(CV)
data <- scale(CV)
d <- dist(data,method = "euclidean")
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, CV, groups1)
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}

Hierarchical clustering does not use (or compute) representatives.
In particular for single link (but it can also happen for other linkages), the "center" can be in a different cluster. Just consider the top two data sets in example:
Furthermore, the centroid (mean) is connected to Euclidean distance. With other distances, it may be a very bad representative.
So use with care!
Either way, hierarchical clustering does not define or compute a representative. You will have to do this yourself.

Related

How to permute a network in igraph for R?

I'm trying to write a code for a Monte Carlo procedure in R. My goal is to estimate the significance of a metric calculated for a weighted, unipartite, undirected network formatted for the package igraph.
So far, I included the following steps in the code:
1. Create the weighted, unipartite, undirected network and calculate the observed Louvain modularity
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs=modularity1$modularity
obs
real<-data.frame(obs)
real
2. Create the empty vector
Nperm = 9 #I am starting with a low n, but intend to use at least 1000 permutations
randomized.modularity=matrix(nrow=length(obs),ncol=Nperm+1)
row.names(randomized.modularity)=names(obs)
randomized.modularity[,1]=obs
randomized.modularity
3. Permute the original network preserving its characteristics, calculate the Louvain modularity for all randomized networks, and compile the results in the vector
i<-1
while(i<=Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod<-(cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity[,i+1]=linha
print(i)
i=i+1
}
randomized.modularity #Here the result is not as expected
4. Plot the observed value against the distribution of randomized values
niveis<-row.names(randomized.modularity)
for(k in niveis)
{
if(any(is.na(randomized.modularity[k,]) == TRUE))
{
print(c(k, "metrica tem NA"))
} else {
nome.arq<- paste("modularity",k,".png", sep="")
png(filename= nome.arq, res= 300, height= 15, width=21, unit="cm")
plot(density(randomized.modularity[k,]), main="Observed vs. randomized",)
abline(v=obs[k], col="red", lwd=2, xlab="")
dev.off()
print(k)
nome.arq<- paste("Patefield_Null_mean_sd_",k,".txt", sep="")
write.table(cbind(mean(randomized.modularity[k,]),sd(randomized.modularity[k,])), file=paste(nome.arq,sep=""),
sep=" ",row.names=TRUE,col.names=FALSE)
}
}
5. Estimate the P-value (significance)
significance=matrix(nrow=nrow(randomized.modularity),ncol=3)
row.names(significance)=row.names(randomized.modularity)
colnames(significance)=c("p (rand <= obs)", "p (rand >= obs)", "p (rand=obs)")
signif.sup=function(x) sum(x>=x[1])/length(x)
signif.inf=function(x) sum(x<=x[1])/length(x)
signif.two=function(x) ifelse(min(x)*2>1,1,min(x)*2)
significance[,1]=apply(randomized.modularity,1,signif.inf)
significance[,2]=apply(randomized.modularity,1,signif.sup)
significance[,3]=apply(significance[,-3],1,signif.two)
significance
Something is going wrong in step 3. I expected the vector to be filled with 10 values, but for some reason it stops after a while.
The slot "mod$modularity" suddenly receives 2 values instead of 1.
The two TXT files mentioned in the beginning of the code can be downloaded from here:
https://1drv.ms/t/s!AmcVKrxj94WClv8yQyqyl4IWk5mNvQ
https://1drv.ms/t/s!AmcVKrxj94WClv8z_Pow5Tg2U7mjLw
Could you please help me?
Your error is due to a mismatch in dimensions with your randomized.modularity matrix and some of your randomized modularity results. In your example your matrix end up being [1 x Nperm] however sometimes 2 modularity scores are returned during the permutations. To fix this I simply store the results in a list. The rest of your analysis will need to be adjusted since you have a mismatch of modularity scores.
library(igraph)
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs <- modularity1$modularity
obs
real<-data.frame(obs)
real
Nperm = 100 #I am starting with a low n, but intend to use at least 1000 permutations
#randomized.modularity <- matrix(nrow=length(obs),ncol=Nperm+1)
#row.names(randomized.modularity) <- names(obs)
randomized.modularity <- list()
randomized.modularity[1] <- obs
randomized.modularity
for(i in 1:Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod <- (cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity <- c(randomized.modularity, list(linha))
}
randomized.modularity
Better way to write the loop
randomized.modularity <- lapply(seq_len(Nperm), function(x){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
return(cluster_louvain(randomnet)$modularity)
})

Weights in cforest 'tree' add up to more than sample size

This is my first time asking a question here so please be kind... and I apologise if this isn't the correct place to ask this question (the line between programming in R and a stats question can be a bit blurred to me) - if so I will happily try stackexchange.
I am running cforest in R (party package) to predict a numerical response variable, and using the excellent 'get_cTree' workaround suggested by #Marco Sandri here: https://stackoverflow.com/a/34534978/9989544
to generate a tree to try to understand the rules it's using to split (that's of interest to me even though my main focus is variable importance).
I was expecting the weights, across all of the nodes, to sum to my total sample size, which is what happens if I run a single 'ctree'.
However, what is actually happening, when using Marco Sandri's get_cTree code is that multiple pairs of nodes weights sum to my sample size, and the remaining weights do not sum to my sample size at all. The overall total of weights is more than my total sample size.
Is this a natural consequence of trying to get a tree out of a conditional forest - i.e. it isn't truly partioning the data into individual nodes? - or is this something that could be resolved with programming?
Here is an example (get_cTree code from Marco Sandri). For the iris dataset, n=150. The sum of the weights for the nodes that I get for the cforest is 566, and it's 150 using ctree (party package).
library(party)
update_tree <- function(x, dt) {
x <- update_weights(x, dt)
if(!x$terminal) {
x$left <- update_tree(x$left, dt)
x$right <- update_tree(x$right, dt)
}
x
}
update_weights <- function(x, dt) {
splt <- x$psplit
spltClass <- attr(splt,"class")
spltVarName <- splt$variableName
spltVar <- dt[,spltVarName]
spltVarLev <- levels(spltVar)
if (!is.null(spltClass)) {
if (spltClass=="nominalSplit") {
attr(x$psplit$splitpoint,"levels") <- spltVarLev
filt <- spltVar %in% spltVarLev[as.logical(x$psplit$splitpoint)]
} else {
filt <- (spltVar <= splt$splitpoint)
}
x$left$weights <- as.numeric(filt)
x$right$weights <- as.numeric(!filt)
}
x
}
get_cTree <- function(cf, k=1) {
dt <- cf#data#get("input")
tr <- party:::prettytree(cf#ensemble[[k]], names(dt))
tr_updated <- update_tree(tr, dt)
new("BinaryTree", tree=tr_updated, data=cf#data, responses=cf#responses,
cond_distr_response=cf#cond_distr_response, predict_response=cf#predict_response)
}
attach(iris)
SepalLength <- as.numeric(iris$Sepal.Length)
SepalWidth <- as.numeric(iris$Sepal.Width)
PetalLength <- as.numeric(iris$Petal.Length)
PetalWidth <- as.numeric(iris$Petal.Width)
Species <- as.factor(iris$Species)
mtry=ceiling(sqrt(4))
set.seed(1)
iris_cforest <- cforest(PetalLength~SepalLength+SepalWidth+PetalWidth+Species,controls=cforest_unbiased(ntree=1000,mtry=mtry))
iristree <- get_cTree(iris_cforest)
iristree
plot(iristree)
set.seed(1)
iris_ctree <- ctree(PetalLength~SepalLength+SepalWidth+PetalWidth+Species)
iris_ctree
plot(iris_ctree)

How to predict cluster labeling using DBSCAN object and Gower distance matrix for new data in R

I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data.
I used gower distance matrix when creating the model:
> gowerdist_train <- daisy(analdata_train,
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)
Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)
But I receive the following error:
Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix
I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data.
My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for test data because distances must be relative to the test data, not the test data itself.
Edit:
I tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:
> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
ERROR: Error in 1:nrow(data) : argument of length 0 In addition:
Warning message: In rbind(data, newdata) : number of columns of
result is not a multiple of vector length (arg 1)
So, my suggested solution doesn't work.
I decided to create a code that would use KNN algorithm in dbscan to predict cluster labeling using gower distance matrix. The code is not very pretty and definitely not programmaticaly efficient but it works. Happy for any suggestions that would improve it.
The pseydocode is:
1) calculate new gower distance matrix for all data, including test and train
2) use the above distance matrix in kNN function (dbscan package) to determine the k nearest neighbours to each test data point.
3) determine the cluster labels for all those nearest points for each test point. Some of them will have no cluster labeling because they are test points themselves
4) create a count matrix to count the frequency of clusters for the k nearest points for each test point
5) use very simple likelihood calculation to choose the cluster for the test point based on its neighbours clusters (the maximum frequency). this part also considers the neighbouring test points. That is, the cluster for the test point is chosen only when the maximum frequency is largest when you add the number of neighbouring test points to the other clusters. Otherwise, it doesn't decide the cluster for that test point and waits for the next iteration when hopefully more of its neighboring test points have had their cluster label decided based on their neighbours.
6) repeat above (steps 2-5) until you've decided all clusters
** Note: this algorithm doesn't converge all the time. (once you do the math, it's obvious why that is) so, in the code i break out of the algorithm when the number of unclustered test points doesn't change after a while. then i repeat 2-6 again with new knn (change the number of nearest neighbours and then run the code again). This will ensure more points are involved in deciding in th enext round. I've tried both larger and smaller knn's and both work. Would be good to know which one is better. I haven't had to run the code more than twice so far to decide the clusters for the test data point.
Here is the code:
#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
metric = "gower",
stand = FALSE,
type = list(asymm = listofasymmvars),
weights = Weights)
summary(gowerdist_test)
Then use the code below to label clusters for test data.
#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods
while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX)
{
if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
invisible(gc())
break;
}
nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)
# for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
numClusts <- nlevels(as.factor(sb_train$cluster))
NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1) #create a count matrix that would count number of clusters + NA
colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers
# get the cluster number of each k nearest neibhour of each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
}
# populate the count matrix for the total clusters of the neighbours for each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
if (!is.na(test_matrix[i,j]))
count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
else
count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
}
# add NA's (TEST points) to the other clusters for comparison
count_clusters_withNA <- count_clusters
for (i in 2:ncol(count_clusters))
{
count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
}
# This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added
compareCountClust <- max_col_countclusters_withNA == max_col_countclusters #compare the two count matrices
all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column
iteration_current <- iteration_current + 1
unclusteredNum <- sum(is.na(all_data$Cluster))
if (previousUnclustereNum == unclusteredNum)
repeatedUnclustNum <- repeatedUnclustNum + 1
else {
repeatedUnclustNum <- 0
previousUnclustereNum <- unclusteredNum
}
print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
if (unclusteredNum == 0)
print("Cluster labeling successfully Completed.")
invisible(gc())
}
I guess you can use this for any other type of clustering algorithm, it doesn't matter how you decided the cluster labels for the train data, as long as they are in your all_data before running the code.
Hope this help.
Not the most efficient or rigorous code. So, happy to see suggestions how to improve it.
*Note: I used t-SNE to compare the clustering of train with the test data and looks impressively clean. so, it seems it is working.

Output multiple vectors from for loop in R

As someone relatively new to R I'm having an issue with creating a for loop.
I have a very large data set with 9000 observations and 25 categorical variables, which I've transformed into binary data and preformed hierarchical clustering. Now I want to try K-Modes clustering to produce an Elbow Plot using the "within-cluster simple-matching distance for each cluster", which is outputted from kmodes$withindiff. I can sum this for each of the k in 1:8 clusters to get the Elbow Plot.
library(klaR)
for(k in 1:8)
{
WCSM[k] <- sum(kmodes(data,k,iter.max=100)$withindiff)
}
plot(1:8,WCSM,type="b", xlab="Number of Clusters",ylab="Within-Cluster
Simple-Matching Distance Summed", main="K-modes Elbow Plot")
My issue is that I want further output from k-modes. For each k in 1:8 I would like to get the vector of integers indicating the cluster to which each object is allocated to given by kmodes$cluster. I need to create a for loop that loops through each k in 1:8 and saves each of the outputs into 8 separate vectors. But I don't know how to do such a for loop. I could just run the 8 lines of code separately but they each take 15mins to run with iter.max=10 so increasing this to iter.max=100 will need to be left running overnight so a loop would be useful.
cl.kmodes2=kmodes(data, 2,iter.max=100)
cl.kmodes3=kmodes(data, 3,iter.max=100)
cl.kmodes4=kmodes(data, 4,iter.max=100)
cl.kmodes5=kmodes(data, 5,iter.max=100)
cl.kmodes6=kmodes(data, 6,iter.max=100)
cl.kmodes7=kmodes(data, 7,iter.max=100)
cl.kmodes8=kmodes(data, 8,iter.max=100)
Ultimately I want to compare the results from the hierarchical binary clustering to the k-modes clustering by getting the Adjusted Rand Index. For example, cutting the tree at k=4 for the hierarchical cluster and comparing this to a 4 cluster solution from k-modes:
dist.binary = dist(data, method="binary")
cl.binary = hclust(dist.binary, method="complete")
hcl.4 = cutree(cl.binary, k = 4)
tab = table(hcl.4, cl.kmodes4$cluster)
library(e1071)
classAgreement(tab)
I agree with Imo, using a list is the best solution.
If you don't want to do that, you could also use assign() to create a new vector in every iteration:
library(klaR)
for(k in 1:8) {
assign(paste("cl.kmodes", k, sep = ""), kmodes(data, k, iter.max = 100))
}
The best method is to put the output from your clusters into a named list:
library(klaR)
myClusterList <- list()
for(k in 1:8) {
myClusterList[[paste0("k.", i)]] <- kmodes(data, i,iter.max=100)
}
You can then pull out the any of the contents easily:
sum(myClusterList[["k.1"]]$withindiff)
or
sum(myClusterList[[1]]$withindiff)
You can also save the list to use in future R sessions, see ?save.

R kmeans (stats) vs Kmeans (amap)

Hello stackoverflow community,
I'm running kmeans (stats package) and Kmeans (amap package) on the Iris dataset. In both cases, I use the same algorithm (Lloyd–Forgy), the same distance (euclidean), the same number of initial random sets (50), the same maximal number of iterations (1000), and I test for the same set of k values (from 2 to 15). I also use the same seed for both cases (4358).
I don't understand why under these conditions I'm getting different wss curves, in particular: the "elbow" using the stats package is much less accentuated than when using the amap package.
Could you please help me to understand why? Thanks much!
Here the code:
# data load and scaling
newiris <- iris
newiris$Species <- NULL
newiris <- scale(newiris)
# using kmeans (stats)
wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50,
algorithm="Lloyd")$withinss)
}
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50,
method="euclidean")$withinss)
}
# plots
plot(1:15, wss1, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="kmeans (stats package)")
plot(1:15, wss2, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Kmeans (amap package)")
EDIT:
I've emailed the author of the amap package and will post the reply when/if I get any.
https://cran.r-project.org/web/packages/amap/index.html
The author of the amap package, changed the code and the value of withinss variable is the sum applied by method (eg. euclidean distance).
One way to solve this, given the return of Kmeans function (amap), recalculate the value of withinss ( Error Sum of Squares (SSE) ).
Here is my suggestion:
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")
wss <- vector(mode = "numeric", length=i)
for (j in 1:i) {
km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),])
## average = as.matrix( t(apply(km,2,mean) ))
## wss[j] = sum( apply(km, 1, function(x) sum((x-average) ^ 2 )))
## or
wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var))
}
wss2[i] = sum(wss)
}
Note. The method for pearson in this package is wrong (be careful !) on version 0.8-14.
Line 325 according code in this link:
https://github.com/cran/amap/blob/master/src/distance_T.inl

Resources