an error in R code written for finding clusters - r

I am relatively new to R. I just try to find out optimum number of clusters for iris data using the following methods:
library(datasets)
head(iris)
# method1:
wss <- (nrow(iris)-1)*sum(apply(iris,2,var))
for (i in 2:3) wss[i] <- sum(kmeans(iris, centers=i)$withinss)
plot(1:3, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# method2:
library(fpc)
pamk.best <- pamk(iris)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(iris, pamk.best$nc))
Both methods throw up error. So please do someone shed light on it. Many Thanks in advance.

apply(iris,2,var)
gives you an error because the 4th column is not numeric.
Try
apply(iris[,1:4],2,var)
The same goes for the second method.
Error in pam(sdata, k, diss = diss, ...) :
x is not a numeric dataframe or matrix.

Related

How to permute a network in igraph for R?

I'm trying to write a code for a Monte Carlo procedure in R. My goal is to estimate the significance of a metric calculated for a weighted, unipartite, undirected network formatted for the package igraph.
So far, I included the following steps in the code:
1. Create the weighted, unipartite, undirected network and calculate the observed Louvain modularity
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs=modularity1$modularity
obs
real<-data.frame(obs)
real
2. Create the empty vector
Nperm = 9 #I am starting with a low n, but intend to use at least 1000 permutations
randomized.modularity=matrix(nrow=length(obs),ncol=Nperm+1)
row.names(randomized.modularity)=names(obs)
randomized.modularity[,1]=obs
randomized.modularity
3. Permute the original network preserving its characteristics, calculate the Louvain modularity for all randomized networks, and compile the results in the vector
i<-1
while(i<=Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod<-(cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity[,i+1]=linha
print(i)
i=i+1
}
randomized.modularity #Here the result is not as expected
4. Plot the observed value against the distribution of randomized values
niveis<-row.names(randomized.modularity)
for(k in niveis)
{
if(any(is.na(randomized.modularity[k,]) == TRUE))
{
print(c(k, "metrica tem NA"))
} else {
nome.arq<- paste("modularity",k,".png", sep="")
png(filename= nome.arq, res= 300, height= 15, width=21, unit="cm")
plot(density(randomized.modularity[k,]), main="Observed vs. randomized",)
abline(v=obs[k], col="red", lwd=2, xlab="")
dev.off()
print(k)
nome.arq<- paste("Patefield_Null_mean_sd_",k,".txt", sep="")
write.table(cbind(mean(randomized.modularity[k,]),sd(randomized.modularity[k,])), file=paste(nome.arq,sep=""),
sep=" ",row.names=TRUE,col.names=FALSE)
}
}
5. Estimate the P-value (significance)
significance=matrix(nrow=nrow(randomized.modularity),ncol=3)
row.names(significance)=row.names(randomized.modularity)
colnames(significance)=c("p (rand <= obs)", "p (rand >= obs)", "p (rand=obs)")
signif.sup=function(x) sum(x>=x[1])/length(x)
signif.inf=function(x) sum(x<=x[1])/length(x)
signif.two=function(x) ifelse(min(x)*2>1,1,min(x)*2)
significance[,1]=apply(randomized.modularity,1,signif.inf)
significance[,2]=apply(randomized.modularity,1,signif.sup)
significance[,3]=apply(significance[,-3],1,signif.two)
significance
Something is going wrong in step 3. I expected the vector to be filled with 10 values, but for some reason it stops after a while.
The slot "mod$modularity" suddenly receives 2 values instead of 1.
The two TXT files mentioned in the beginning of the code can be downloaded from here:
https://1drv.ms/t/s!AmcVKrxj94WClv8yQyqyl4IWk5mNvQ
https://1drv.ms/t/s!AmcVKrxj94WClv8z_Pow5Tg2U7mjLw
Could you please help me?
Your error is due to a mismatch in dimensions with your randomized.modularity matrix and some of your randomized modularity results. In your example your matrix end up being [1 x Nperm] however sometimes 2 modularity scores are returned during the permutations. To fix this I simply store the results in a list. The rest of your analysis will need to be adjusted since you have a mismatch of modularity scores.
library(igraph)
nodes <- read.delim("nodes.txt")
links <- read.delim("links.txt")
anurosnet <- graph_from_data_frame(d=links, vertices=nodes, directed=F)
anurosnet
modularity1 = cluster_louvain(anurosnet)
modularity1$modularity #observed value
obs <- modularity1$modularity
obs
real<-data.frame(obs)
real
Nperm = 100 #I am starting with a low n, but intend to use at least 1000 permutations
#randomized.modularity <- matrix(nrow=length(obs),ncol=Nperm+1)
#row.names(randomized.modularity) <- names(obs)
randomized.modularity <- list()
randomized.modularity[1] <- obs
randomized.modularity
for(i in 1:Nperm){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
mod <- (cluster_louvain(randomnet))
mod$modularity
linha = mod$modularity
randomized.modularity <- c(randomized.modularity, list(linha))
}
randomized.modularity
Better way to write the loop
randomized.modularity <- lapply(seq_len(Nperm), function(x){
randomnet <- rewire(anurosnet, with=each_edge(0.5)) #rewire vertices with constant probability
E(randomnet)$weight <- sample(E(anurosnet)$weight) #shuffle initial weights and assign them randomly to edges
return(cluster_louvain(randomnet)$modularity)
})

EM clustering instead of Kmeans

I have the following script that I can use to find the best number of the cluster using kmeans. How to change the following script using the EM clustering technique rather than kmeans.
reproducible example:
ourdata<- scale(USArrests)
Appreciate!
wss <- (nrow(ourdata)-1)*sum(apply(ourdata,2,var))
for (i in 2:10) wss[i] <- sum(kmeans(ourdata,
centers=i)$withinss)
plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
The EMCluster package offers a variety of functions for running EM model-based clustering. An example of finding a solution with k = 3 clusters:
Update per OP's comment:
You can calculate the within sums of squares, along with other metrics of interest, using fpc::cluster.stats(). These can be extracted and plotted akin to your original post. As a reminder, "the elbow technique" as you described is an inaccurate description because the elbow technique is a general techinque and can and is used with any metric of choice. It is not only used for within sums of squares as in your original post.
library(EMCluster)
library(fpc)
ourdata<- scale(USArrests)
dist_fit <- dist(ourdata)
num_clusters <- 2:4
set.seed(1)
wss <- vapply(num_clusters, function(i_k) {
em_fit <- em.EM(ourdata, nclass = i_k, lab = NULL, EMC = .EMC,
stable.solution = TRUE, min.n = NULL, min.n.iter = 10)
cluster_stats_fit <- fpc::cluster.stats(dist_fit, em_fit$class)
cluster_stats_fit$within.cluster.ss
}, numeric(1))
plot(num_clusters, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

R kmeans (stats) vs Kmeans (amap)

Hello stackoverflow community,
I'm running kmeans (stats package) and Kmeans (amap package) on the Iris dataset. In both cases, I use the same algorithm (Lloyd–Forgy), the same distance (euclidean), the same number of initial random sets (50), the same maximal number of iterations (1000), and I test for the same set of k values (from 2 to 15). I also use the same seed for both cases (4358).
I don't understand why under these conditions I'm getting different wss curves, in particular: the "elbow" using the stats package is much less accentuated than when using the amap package.
Could you please help me to understand why? Thanks much!
Here the code:
# data load and scaling
newiris <- iris
newiris$Species <- NULL
newiris <- scale(newiris)
# using kmeans (stats)
wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50,
algorithm="Lloyd")$withinss)
}
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50,
method="euclidean")$withinss)
}
# plots
plot(1:15, wss1, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="kmeans (stats package)")
plot(1:15, wss2, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Kmeans (amap package)")
EDIT:
I've emailed the author of the amap package and will post the reply when/if I get any.
https://cran.r-project.org/web/packages/amap/index.html
The author of the amap package, changed the code and the value of withinss variable is the sum applied by method (eg. euclidean distance).
One way to solve this, given the return of Kmeans function (amap), recalculate the value of withinss ( Error Sum of Squares (SSE) ).
Here is my suggestion:
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")
wss <- vector(mode = "numeric", length=i)
for (j in 1:i) {
km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),])
## average = as.matrix( t(apply(km,2,mean) ))
## wss[j] = sum( apply(km, 1, function(x) sum((x-average) ^ 2 )))
## or
wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var))
}
wss2[i] = sum(wss)
}
Note. The method for pearson in this package is wrong (be careful !) on version 0.8-14.
Line 325 according code in this link:
https://github.com/cran/amap/blob/master/src/distance_T.inl

How to find the optimal number of clusters?

I know this question has already been asked, but I am failing to implement a decent plot for the following code:
options(digits=1)
set.seed(2014)
mydata <- matrix(seq(1,360),nrow=10,ncol=36)
wss <- c()
for (i in 1:19) wss[i] <- sum(kmeans(x=mydata,centers=seq(1,360,length.out=20)[i])$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
It produces the following error
Error in sample.int(m, k) :
cannot take a sample larger than the population when 'replace = FALSE'
kmeans assumes that each row is your data is an observation. So if you have k rows in x, the results of $clusters will be of lenth k. Here your test data has 10 rows. Yet you are specifying centers=20 when i=2 There is no way that 10 observations can have 20 different clusters.
Just a little spark in the dark!
options(digits=1)
set.seed(2014)
mydata <- seq(from=1,to=365)
wss <- c()
for (i in 5:15){
wss[i-4] <- sum(kmeans(mydata,centers=floor(seq(from=1,to=365,length.out=i)[-i]))$withinss)
}
plot(1:15,wss,type="b",xlab="Number of Clusters",ylab="Within groups sum of squares")
Does that make sense? #jlhoward #jbaums

How do I weight variables with gower distance in r

I am new to R and am working on a data set including nominal, ordinal and metric data.
Therefore I am using the gower distance. In the next step I use this distance with hclust(x, method="complete") to create clusters based on this distance.
Now I want to know how I can put different weights on variables in the gower distance.
The documentation says:
daisy(x, metric = c("euclidean", "manhattan", "gower"), stand = FALSE, type = list(), weights = rep.int(1, p))
So there is a way, but I am unsure about the syntax (weights = ...).
The documentation of weights and rep.int, did not help.
I also didn't find any other helpful explanation.
I would be very glad, if some one can help out.
Not sure if this is what you are getting at, but...
Let's say you have 5 variables, e.g. 5 columns in your data frame or matrix. Then weights would be a vector of length=5 containing the weights for the corresponding columns.
The notation weights=rep.int(1,p) in the documentation simply means that the default value of weights is a vector of length p that has all 1's, eg. the weights are all equal to 1. Elsewhere in the documentation it explains that p is the number of columns.
Also, note that daisy(...) produces a dissimilarity matrix. This is what you use in hclust(...). So if x is a data frame or matrix with five columns for your variables, then:
d <- daisy(x, metric="gower", weights=c(1,2,3,4,5))
hc <- hclust(d, method="complete")
EDIT (Response to OP's comments)
The code below shows how the clustering depends on the weights.
clust.anal <- function(df,w,h) {
require(cluster)
d <- daisy(df, metric="gower", weights=w)
hc <- hclust(d, method="complete")
clust <- cutree(hc,h=h)
plot(hc, sub=paste("weights=",paste(wts,collapse=",")))
rect.hclust(hc,h=0.8,border="red")
}
df <- read.table("ExampleClusterData.csv", sep=";",header=T)
df[1] <- factor(df[[1]])
df[2] <- factor(df[[2]])
# weights increase with col number...
wts=c(1,2,3,4,5,6,7)
clust.anal(df,wts,h=0.8)
# weights decrease with col number...
wts=c(7,6,5,4,3,2,1)
clust.anal(df,wts,h=0.8)

Resources