K-means: Initial centers are not distinct - r

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
1. Binary encoding and generate initial population.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object#popSize, sample(rep(c(1, 0), c(k_min, object#nBits - k_min))), TRUE))
return(init)
}
2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- corpus[x==1, -1]
cl <- kmeans(corpus[-1], initial_centroid)
dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
return(score)
}
3. Running GA. With these settings.
g2<- ga(type = "binary",
fitness = DBI2,
population = initial_population,
selection = ga_rwSelection,
crossover = gabin_spCrossover,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(corpus),
seed = 123)
4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.
I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.
Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:
if(any(duplicated(centers)))
stop("initial centers are not distinct")
I edited the kmeans function with trace to print out the duplicated centers. The output:
[1] "206" "520" "564" "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"
Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?
P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).

Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.
So my main advise is to not use genetic algorithms at all here!
If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".

To answer your question just do:
any(corpus[520, -1] != corpus[564, -1])
Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:
identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])
Regarding the GA and k-means, see e.g.:
Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1

Related

topicmodels has inverted functions $topics and $terms. Is it reliable?

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library.
For the purpose of making reproduction of the problem easy, I create a vector with three documents, and impose 5 topics in the LDA model. The full code is as follows:
#install.packages("tm")
library("tm")
#install.packages("topicmodels")
library("topicmodels")
vector_of_speeches<- c("feder reserv commit use full rang tool support us economi challeng time therebi promot maxemploy pricest goal", "progress strong polici support indic economicact employ continu strengthen sector advers affect pandem improv recent month continu affect covid job gain solid recent month unemploymentr declin substanti suppli demand imbal relat pandem economi continu contribut elev level inflat overal financialcondit remain accommod part reflect polici measur support economi flow credit us household busi","path economi continu depend cours viru progress eas suppli expect support continu gain economicact employ reduct inflat risk economicoutlook remain includ new viru")
df <- as.data.frame(vector_of_speeches)
myCorpus <- Corpus(VectorSource(df$vector_of_speeches))
dtm <- TermDocumentMatrix(myCorpus)
inspect(dtm) # 3 documents and 68 different words
#LDA prep
burnin <- 4000
iter <- 4000
keep <- 50
k<-5
delta_gibbs <- 0.025
alpha_gibbs <- 50/k
seed=0
fomc_LDA <- LDA(dtm, k=k, method = "Gibbs", control = list(seed=seed, burnin = burnin, iter = iter, keep = keep))
str(as.matrix(posterior(fomc_LDA)$terms)) #dimension is 5 x 3, so the number of topics is being related with the number of documents
str(as.matrix(posterior(fomc_LDA)$topics)) #dimension is 68 x 5, so the number of unique words is being related with the number of documents
The functions that extracts the topic distribution per document is #topics, and the one which extracts vocabulary distribution per topics is $ terms. However, clearly they are inverted in the above code (the topic distribution is actually extracted from the $terms function). Why is this ocurring, and is it safe to use the topic distributions per document that are being returned by the $terms function?
When I use the full vector of documents (almost 2000), I tried to transpose the document term document, writing dtm <- t(dtm), but then, running the LDA model yields the following error:
Error in LDA(dtm, k = k, method = "Gibbs", control = list(seed = seed, :
Each row of the input matrix needs to contain at least one non-zero entry
Why does this occur? Weird that the $topics and $terms functions seem inverted when it comes to the output they deliver, and I am not sure if I can thus rely on the $terms function to obtain the correct topic distributions per document(which is what I need).

How to perform k-mean clustering in R

I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!
Code:
data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")
Warning:
Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
2: did not converge in 10 iterations
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
4: did not converge in 10 iterations
You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:
data(iris)
scaled_iris <- scale(iris[, -5])
Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:
wss <- sum(colSums(scaled_iris^2))
wss
# [1] 596
Now the the clustering. I'll include the argument that #mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max= to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart= argument and set a value of 5 or more:
for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
head(wss);tail(wss)
# [1] 596.00000 220.87929 138.88836 113.97017 104.98669 81.03783
# [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
plot(wss, type='b', xlab="Clusters", ylab="WSS")
Note you misspelled withinss and you did not realize that kmeans returns their sum as tot.withinss. It is always good to read the manual page ?kmeans. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.

Data perturbation - How to perform it?

I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.

Kappa Statistic Extremely Large/Sparse matrix

I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?
The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.

Document Term Matrix for Naive Bayes classfier: unexpected results R

I'm having some very annoying problems getting a Naive Bayes Classifier to work with a document term matrix. I'm sure I'm making a very simple mistake but can't figure out what it is. My data is from accounts spreadsheets. I've been asked to figure out which categories (in text format: mostly names of departments or names of budgets) are more likely to spend money on charities and which ones mostly (or only) spend on private companies. They suggested I use Naive Bayes classifiers to do this. I have a thousand or so rows of data to train a model and many hundreds of thousands of rows to test the model against. I have prepared the strings, replacing spaces with underscores and ands/&s with +, then treated each category as one term: so 'alcohol and drug addiction' becomes: alcohol+drug_addiction.
Some example rows:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
Using this example as a template, I wrote the following function to come up with my document term matrix (using tm), both for training and test data.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
So far, so good. The problem is when I try to a) apply a Naive Bayes model and b) predict from that model. Using klar package - I get a zero probability error, since many of the terms have zero instances of one category and playing around with the laplace terms does not seem to fix this. Using e1071, the model worked, but then when I tested the model using:
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
... every single item predicted the same category, even though they should be roughly equal. Something in the model clearly isn't working. Looking at some of the terms in model$tables - I can see that many have high values for private and zero for charity and others vice versa. I have used as.factor for the code.
output:
rs 1 2
1 0 0
2 19 17
Any ideas on what is going wrong? Do dtm matrices not play nice with naivebayes? Have I missed a step out in preparing the data? I'm completely out of ideas. Hope this is all clear. Happy to clarify if not. Any suggestions would be much appreciated.
I have already had the problem myself. You have done (as far as I see it) everything right, the Naive Bayes Implementation in e1071 (and thus klar) is buggy.
But there is an easy and quick fix so that Naive Bayes as implemented in e1071 works again: You should change your text-vectors to categorial variables, i.e. as.factor. You have already done this with your target variable traindata$Code, yet you have to also do this for your trainmatrix and for sure then your testdata.
I could not track the bug to 100% percent down, but it lies in this part in the naive bayes implementation from e1071 (I may note, klar is only a wrapper around e1071):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
You see that there is an if-else-condition: if we have no numerics, naive bayes is used as we expect it to work. If we have numerics - and here comes the bug - this naive bayes automatically assumes a normal distribution. If you only have 0 and 1 in your text, dnorm pretty much sucks. I assume due to very low values created by dnorm the prob. are always replaced by the threshold and thus the variable with the higher a priori factor will always „win“.
However, if I understand your problem correct, you do not even need prediction, rather the a priori factor for identifying which department gives money to whom. Then all you have to do is have a deep look at your model. In your model for every term there appears the apriori probability, which is what I assume you are looking for. Let's do this and the aforementioned with a slightly modified version of your sample:
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
But as I have already said, you do not need to predict. A look at the model is all right, e.g. model$tables$premises will give you the likelihood for the premises giving money to private corporations: 100 %.
If you are dealing with very large datasets, you should specify threshold and eps in your model. Eps defines the limit, when the threshold should be supplied. E.g. eps = 0 and threshold = 0.000001 can be of use.
Furthermore you should stick to using term-frequency weighting. tf*idv e.g. will not work due to the dnorm in the naive bayes.
Hope I can finally get my 50 reputation :P

Resources