How to perform k-mean clustering in R - r

I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!
Code:
data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")
Warning:
Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
2: did not converge in 10 iterations
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
4: did not converge in 10 iterations

You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:
data(iris)
scaled_iris <- scale(iris[, -5])
Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:
wss <- sum(colSums(scaled_iris^2))
wss
# [1] 596
Now the the clustering. I'll include the argument that #mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max= to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart= argument and set a value of 5 or more:
for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
head(wss);tail(wss)
# [1] 596.00000 220.87929 138.88836 113.97017 104.98669 81.03783
# [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
plot(wss, type='b', xlab="Clusters", ylab="WSS")
Note you misspelled withinss and you did not realize that kmeans returns their sum as tot.withinss. It is always good to read the manual page ?kmeans. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.

Related

Louvain community detection in R using igraph - format of edges and vertices

I have a correlation matrix of scores that I would like to run community detection on using the Louvain method in igraph, in R. I converted the correlation matrix to a distance matrix using cor2dist, as below:
distancematrix <- cor2dist(correlationmatrix)
This gives a 400 x 400 matrix of distances from 0-2. I then made the list of edges (the distances) and vertices (each of the 400 individuals) using the below method from http://kateto.net/networks-r-igraph (section 3.1).
library(igraph)
test <- as.matrix(distancematrix)
mode(test) <- "numeric"
test2 <- graph.adjacency(test, mode = "undirected", weighted = TRUE, diag = TRUE)
E(test2)$weight
get.edgelist(test2)
From this I then wrote csv files of the 'from' and 'to' edge list, and corresponding weights:
edgeweights <-E(test2)$weight
write.csv(edgeweights, file = "edgeweights.csv")
fromtolist <- get.edgelist(test2)
write.csv(fromtolist, file = "fromtolist.csv")
From these two files I produced a .csv file called "nodes.csv" which simply had all the vertex IDs for the 400 individuals:
id
1
2
3
4
...
400
And a .csv file called "edges.csv", which detailed 'from' and 'to' between each node, and provided the weight (i.e. the distance measure) for each of these edges:
from to weight
1 2 0.99
1 3 1.20
1 4 1.48
...
399 400 0.70
I then tried to use this node and edge list to create an igraph object, and run louvain clustering in the following way:
nodes <- read.csv("nodes.csv", header = TRUE, as.is = TRUE)
edges <- read.csv("edges.csv", header = TRUE, as.is = TRUE)
clustergraph <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes)
clusterlouvain <- cluster_louvain(clustergraph)
Unfortunately this did not do the louvain community detection correctly. I expected this to return around 2-4 different communities, which could be plotted similarly to here, but sizes(clusterlouvain) returned:
Community sizes
1
400
indicating that all individuals were sorted into the same community. The clustering also ran immediately (i.e. with almost no computation time), which also makes me think it was not working correctly.
My question is: Can anyone suggest why the cluster_louvain method did not work and identified just one community? I think I must be specifying the distance matrix or edges/nodes incorrectly, or in some other way not giving the correct input to the cluster_louvain method. I am relatively new to R so would be very grateful for any advice. I have successfully used other methods of community detection on the same distance matrix (i.e. k-means) which identified 2-3 communities, but would like to understand what I have done wrong here.
I'm aware there are multiple other queries about using igraph in R, but I have not found one which explicitly specifies the input format of the edges and nodes (from a correlation matrix) to get the louvain community detection working correctly.
Thank you for any advice! I can provide further information if helpful.
I believe that cluster_louvain did exactly what it should do with your data.
The problem is your graph.Your code included the line get.edgelist(test2). That must produce a lot of output. Instead try, this
vcount(test2)
ecount(test2)
Since you say that your correlation matrix is 400x400, I expect that you will
get that vcount gives 400 and ecount gives 79800 = 400 * 399 / 2. As you have
constructed it, every node is directly connected to all other nodes. Of course there is only one big community.
I suspect that what you are trying to do is group variables that are correlated.
If the correlation is near zero, the variables should be unconnected. What seems less clear is what to do with variables with correlation near -1. Do you want them to be connected or not? We can do it either way.
You do not provide any data, so I will illustrate with the Ionosphere data from
the mlbench package. I will try to mimic your code pretty closely, but will
change a few variable names. Also, for my purposes, it makes no sense to write
the edges to a file and then read them back again, so I will just directly
use the edges that are constructed.
First, assuming that you want variables with correlation near -1 to be connected.
library(igraph)
library(mlbench) # for Ionosphere data
library(psych) # for cor2dist
data(Ionosphere)
correlationmatrix = cor(Ionosphere[, which(sapply(Ionosphere, class) == 'numeric')])
distancematrix <- cor2dist(correlationmatrix)
DM1 <- as.matrix(distancematrix)
## Zero out connections where there is low (absolute) correlation
## Keeps connection for cor ~ -1
## You may wish to choose a different threshhold
DM1[abs(correlationmatrix) < 0.33] = 0
G1 <- graph.adjacency(DM1, mode = "undirected", weighted = TRUE, diag = TRUE)
vcount(G1)
[1] 32
ecount(G1)
[1] 140
Not a fully connected graph! Now let's find the communities.
clusterlouvain <- cluster_louvain(G1)
plot(G1, vertex.color=rainbow(3, alpha=0.6)[clusterlouvain$membership])
If instead, you do not want variables with negative correlation to be connected,
just get rid of the absolute value above. This should be much less connected
DM2 <- as.matrix(distancematrix)
## Zero out connections where there is low correlation
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
plot(G2, vertex.color=rainbow(4, alpha=0.6)[clusterlouvain$membership])

K-means: Initial centers are not distinct

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
1. Binary encoding and generate initial population.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object#popSize, sample(rep(c(1, 0), c(k_min, object#nBits - k_min))), TRUE))
return(init)
}
2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- corpus[x==1, -1]
cl <- kmeans(corpus[-1], initial_centroid)
dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
return(score)
}
3. Running GA. With these settings.
g2<- ga(type = "binary",
fitness = DBI2,
population = initial_population,
selection = ga_rwSelection,
crossover = gabin_spCrossover,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(corpus),
seed = 123)
4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.
I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.
Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:
if(any(duplicated(centers)))
stop("initial centers are not distinct")
I edited the kmeans function with trace to print out the duplicated centers. The output:
[1] "206" "520" "564" "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"
Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?
P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).
Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.
So my main advise is to not use genetic algorithms at all here!
If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".
To answer your question just do:
any(corpus[520, -1] != corpus[564, -1])
Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:
identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])
Regarding the GA and k-means, see e.g.:
Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1

Kappa Statistic Extremely Large/Sparse matrix

I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?
The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.

Fractal Package in R - Does it work?

When I run the example code for a random walk using the fractal package:
library(fractal)
## create test series
set.seed(100)
x <- rnorm(1024)
walk <- cumsum(x)
## calculate the Hurst coefficient of a random
## walk series using various techniques
methods <- c("aggabs","aggvar","diffvar","higuchi")
z <- lapply(methods, function(method, walk){
hurstBlock(ifelse1(method=="higuchi",diff(walk),walk), method=method)
},walk=walk )
names(z) <- methods
## plot results
old.plt <- splitplot(2,2,1)
for (i in 1:4){
if (i > 1)
splitplot(2,2,i)
plot(z[[i]], key=FALSE)
mtext(paste(attr(z[[i]],"stat.name"), round(as.numeric(z[[i]]),3), sep=", H="),
line=0.5, adj=1)
}
par(old.plt)
I do not get hurst exponent of .5? I understand the there are different method of calculations but I believe AggVar is most like the R/S rescale method and for a brownian motion this should return .5 not .94 like it does.
SHouldn't it be .5?
Also if you change
walk <- cumsum(x) to
walk <- x Then this is white noise and I should get a Hurst Exponent of 0. But I dont.
Can you help clarify?
I have done extensive testing of the three most prominent Hurst exponent libraries since my answer above and posted a video on youtube, However, from fractal, the most common, Hurst estimator used is the hurstpec from fractal and the hurstexp from the pracma library. The other estimator from PerformanceAnalytics econometrics finance library does not seem to be right no matter what estimate you use. I left on the Authors Github maintenance site an error correction request or check, I have not checked back to see the response, I left the request 4 months ago.
iwm_rt <- dailyReturn(IWM, type ="log")
head(iwm_rt)
```{r}
library(pracma)
library(fractal)
x <- iwm_rt
hurstBlock(x, method="aggAbs", scale.min=8, scale.max=NULL,
scale.ratio=9, weight=function(x) rep(1,length(x)), fit=lm)
fit_aggab <- hurstBlock(x, method="aggAbs", scale.min=8, scale.max=NULL,
scale.ratio=9, weight=function(x) rep(1,length(x)), fit=lm)
plot(fit_aggab)
RoverS(x, n.block.min=22, scale.ratio=9, scale.min=8)
hurstSpec(x, method="standard", freq.max=0.25, dc=FALSE, n.block=NULL,
weight=function(x) rep(1,length(x)), fit=lm, sdf.method="direct")
hurstexp(x)
The aggabs method from hurstblock in fractal is an aggregate estimate from a large sample time series, the most crucial factor in fitting the aggabs estimator is the scale.ratio estimate I solve it by trial and error which is a bit tedious but I do not know of a more scientific method, the scale. ratio is the multiplier for each subsequent series of samples from the time series

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources