Consistent K-Means Clustering Results - r

I have read k-means: Same clusters for every execution.
But it doesn't solve the problem I am having. I am sampling data that varies in sizes (increases in sizes). I need to cluster the data using k-means but the problem I am having is that each sample the clusters differ. The important thing to note is that my t+1 sample will always incorporate all of the components from the tth sample. So it slowly gets bigger and bigger. What I need is a way to be able to have the clusters stay the same. Is there a way around this other than using set.seeds? I am open to any solution.

The best way I can think to accomplish this would be to initially cluster the data with k-means and then to simply assign all additional data to closest cluster (setting the random seed will not help you to get the new clusters to nest within the original ones). As detailed in the answer to this question, the flexclust package makes this pretty easy:
# Split into "init" (used for initial clustering) and "later" (assigned later)
set.seed(100)
spl <- sample(nrow(iris), 0.5*nrow(iris))
init <- iris[spl,-5]
later <- iris[-spl,-5]
# Build the initial k-means clusters with "init"
library(flexclust)
(km <- kcca(init, k=3, kccaFamily("kmeans")))
# kcca object of family ‘kmeans’
#
# call:
# kcca(x = init, k = 3, family = kccaFamily("kmeans"))
#
# cluster sizes:
#
# 1 2 3
# 31 25 19
# Assign each element of "later" to the closest cluster
head(predict(km, newdata=later))
# 2 5 7 9 14 18
# 2 2 2 2 2 2

Related

How to perform k-mean clustering in R

I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!
Code:
data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")
Warning:
Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
2: did not converge in 10 iterations
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
4: did not converge in 10 iterations
You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:
data(iris)
scaled_iris <- scale(iris[, -5])
Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:
wss <- sum(colSums(scaled_iris^2))
wss
# [1] 596
Now the the clustering. I'll include the argument that #mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max= to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart= argument and set a value of 5 or more:
for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
head(wss);tail(wss)
# [1] 596.00000 220.87929 138.88836 113.97017 104.98669 81.03783
# [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
plot(wss, type='b', xlab="Clusters", ylab="WSS")
Note you misspelled withinss and you did not realize that kmeans returns their sum as tot.withinss. It is always good to read the manual page ?kmeans. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Setting layers for a Dynamic Bayesian Network with bnstruct in R

I am currently creating a DBN using bnstruct package in R. I have 9 variables in each 6 time steps. I have biotic and abiotic variables. I want to prevent the biotic variables to be parents of the abiotic variables.For a Bayesian Network, it's pretty easy to implement using for instance layering = c(1,1,2,2,2) in learn.dynamic.network().
But a problem rises for the Dynamic part: I would like to keep preventing biotic variables to be parents of abiotic ones in every time step while preventing edges to appear between any variables from t+1 to t.
If I use in layering =:
1 for abiotic variables at t1
2 for biotic variables at t1
3 for abiotic variables at t2
4 for biotic variables at t2...
I allow biotic variables from t-1 to explain the abiotic variables at t (and I don't want that).
So I tried:
## 9 variables for 6 time steps
test1 <- BNDataset(data = timedData,
discreteness = rep('d', 54),
variables = colnames(timedData),
node.sizes = rep(c(3,3,3,2,2,3,3,3,3), 6)
# num.time.steps = 6
)
## the 5 first variables are abiotic, the 4 last are biotics
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2),6))
So now, I don't have any edges from biotic to abiotic (that's nice), but I have edges from variable_t(n+1) to variable_t(n).
I know that in bnlearn you can create a "blacklist" of edges that you don't want to see but I don't see any equivalent arguments in bnstruct. Any idea?
With the mmhc algorithm that is used as default, you can use the layer.struct parameter to specify which pairs of layers are allowed to have edges between them. layer.struct takes a binary matrix where cell i,j is 1 if there can be edges going from variables in layer i to variables in layer j, and 0 otherwise.
The best way to use this is to combine it with the manually-specified layering of your first solution.
Perfect, the combination of both arguments layering = and layer.struct = does what I wanted.
I post what I used here just to provide an example:
## DBN study
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2, # set 2 layers per time step
3,3,3,3,3,4,4,4,4,
5,5,5,5,5,6,6,6,6,
7,7,7,7,7,8,8,8,8,
9,9,9,9,9,10,10,10,10,
11,11,11,11,11,12,12,12,12)),
layer.struct = matrix(c(1,0,0,0,0,0,0,0,0,0,0,0, ## allow certain layers to connect to others by hand
1,1,0,0,0,0,0,0,0,0,0,0,
1,0,1,0,0,0,0,0,0,0,0,0,
1,1,1,1,0,0,0,0,0,0,0,0,
1,0,1,0,1,0,0,0,0,0,0,0,
1,1,1,1,1,1,0,0,0,0,0,0,
1,0,1,0,1,0,1,0,0,0,0,0,
1,1,1,1,1,1,1,1,0,0,0,0,
1,0,1,0,1,0,1,0,1,0,0,0,
1,1,1,1,1,1,1,1,1,1,0,0,
1,0,1,0,1,0,1,0,1,0,1,0,
1,1,1,1,1,1,1,1,1,1,1,1),c(12,12)))
Thanks for the quick answer and the package btw

Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10)

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

Applying ezANOVA error work-around to Long Format data

I have a similar problem as described here:
https://stats.stackexchange.com/questions/58435/repeated-measures-error-in-r-ezanova-using-more-levels-than-subjects-balanced-d
Here is an example of what my dataframe looks like:
Participant Visual Audio StimCondition Accuracy
1 Bottom Circle 1st 2 Central Beeps AO2 0.92
1 SIM Circle Left Beep AO2 0.86
2 Bottom Circle 1st 2 Central Beeps CT4 0.12
2 SIM Circle Left Beep CT4 0.56
I have 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions.
When I run the following ezANOVA:
Model <- ezANOVA(data = Shaped.means, dv = .(Accuracy), wid = .(Participant), within = .(Visual, Audio, StimCondition), type = 3, detailed = TRUE)
I get the same error as the linked question above. I have tried changing Type to equal 1 and it does return the output but minus the Sphericity Test.
I've tried to apply the solution to the linked question to my dataset but as mine is in Long Format I'm a bit lost as to what exactly I need to do to achieve the desired stats.
I'll keep playing with it my end but if anyone could help in the mean time it would be much appreciated.
Thanks.
Following the linked question, you have don't have to change much. Assuming your dataset is exactly as you describe, the following should work for you.
Let's first create a dataset to reflect your description
set.seed(123) ## make reproducible
N <- 12 ## number of Participants
S <- 5 ## number of StimCondition groups
V <- 3 ## number of Visual groups
A <- 5 ## number of Audio groups
Accuracy <- abs(round(runif(N*V*S*A), 2)) ## (N x (PxQ))-matrix with voltages
init.Df <- expand.grid(Participant=gl(N,1),
Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
df <- cbind(init.Df, Accuracy)
Now we have a dataframe with 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions. This should be at the stage you are currently at. We can do the between-subjects call easily.
# If you just read in the data set and don't know how many subjects
# N <- length(unique(df$Participant))
fit <- lm(matrix(df[,c("Accuracy")], nrow=N) ~ 1)
For the factor component, this is the only real change. If you simply generate your model design, you can pass it to anova.
library(car)
# You can create your within design table
# You can get these values from your dataset as well
# V <- nlevels(df$Visual)
# A <- nlevels(df$Audio)
# S <- nlevels(df$StimCondition)
# If you want the labels with gl, you can use the levels function (e.g. labels=levels(df$Visual))
inDf <- expand.grid(Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
# Test for Visual
anova(fit, M=~Visual, X=~1, idata=inDf, test="Spherical")
# Test for Audio
anova(fit, M=~Visual+Audio, X=~Visual, idata=inDf, test="Spherical")
# Test for Visual:Audio interaction
anova(fit, M=~Visual+Audio+Visual:Audio, X=~Visual+Audio, idata=inDf, test="Spherical")
#etc...

Resources