K-means clustering doesn't find all clusters in data - r

The set of data I am using is shown below. As one can see you would think the k-means cluster analysis would find the centers of these clusters easily.
However, when I run K-means cluster analysis and plot the centers I get this.
I am using just the basic kmeans code:
cluster <- kmeans(mydata,90)
cluster$centers

A little-known fact about kmeans is that to get reliable results, you need to run the algorithm repeatedly with many random initializations. I typically use kmeans(, nstart = 1000).
In theory, the kmeans++ algorithm does not suffer as much from the initialization problem, but I often find that kmeans with many random restarts performs better than kmeans++. Still, you might want to try kmeans++ using the flexclust R package.

As I mentioned in the comment, using hclust() to find the centers might be a viable approach.
set.seed(1)
l <- 1e4
v1 <- sample(1:10, l, replace=TRUE) + rnorm(l, 0, 0.05)
v2 <- sample(1:13, l, replace=TRUE) + rnorm(l, 0, 0.05)
dtf <- data.frame(v1, v2)
par(mar=c(2, 2, 1, 1))
plot(dtf, pch=16, cex=0.2, col="#00000044")
km <- kmeans(dtf, 10*13)
points(km$centers, cex=2, lwd=0.5, col="red")
hc <- hclust(dist(dtf))
hc <- cutree(hc, 10*13)
hcent <- aggregate(dtf, list(hc), mean)[, -1]
hckm <- kmeans(dtf, hcent)
points(hckm$centers, cex=3, lwd=0.5, col="blue")

This data set is likely to be much better to be clustered by DBSCAN.
Choose epsilon less than the distance of clusters (e.g., 10), and Minpts should not matter much then, e.g., minpts=4

Related

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

Clustering using categorical and continuous data together

I am trying to create a unsupervised model with categorical and continuous data together. I think I have worked it out, but is this the correct way to do this?
Load Libraries
library(tidyr)
library(dummies)
library(fastDummies)
library(cluster)
library(dplyr)
create sample data set
set.seed(3)
sampleData <- data.frame(id = 1:50,
gender = sample(c("Male", "Female"), 10, replace =
TRUE),
age_bracket = sample(c("0-10", "11-30","31-60",">60"),
10, replace = TRUE),
income = rnorm(10, 40, 10),
volume = rnorm(50, 40, 100))
Create sparse matrix and scale
sd1 <- sampleData %>%
dummy_cols(select_columns = c("gender","age_bracket"))%>%
mutate(id = factor(id))%>%
select(-c(gender,age_bracket))%>%
mutate_if(is.numeric, scale)
glimpse(sd1)
Generate a k-means model using the pam() function with a k = 3
sd2 <- pam(sd1, k =3)
Extract the vector of cluster assignments from the model
sd3 <- sd2$cluster
Build the segment_customers dataframe
sd4 <- mutate(sd1, cluster = sd3)
Calculate the size of each cluster
count(sd4, cluster)
Dummy coding of variables is fairly standard, but I am not a fan of it. In many cases this IMHO causes large bias, and hinders interpretability.
In your case, you may additionally be applying standardization to them, which makes variable bias even worse.
Your text claims to use k-means, but uses PAM. These are not the same. PAM is IMHO a better choice here, because of interpretability, and the ability to use other metrics such as Manhattan. The resulting cluster "centers" are data points, not means.
I recommend going down to the mathematical level. PAM tries to minimize the sum of distances to the centers. Now put in the distance you use, e.g., Manhattan. Now substitute the standardization and dummy encoding in there, and you get the actual problem your approach tries to solve. Now have a critical look at this (probably quite large) term: is that helpful for your problem, or are you optimizing the wrong function?

ROC curve based on means and variances of controls and cases

Does anyone know of an R package (or any other statistical freeware or just a piece of code) that lets you plot a smooth ROC curve knowing only the means and variances of the control and case groups? That is, one that doesn't require a dataset with specific classifier values and test outcomes. I found a couple of online graph plotters that do just that:
https://kennis-research.shinyapps.io/ROC-Curves/ ,
http://arogozhnikov.github.io/2015/10/05/roc-curve.html
Any help appreciated
I don't think you need any fancy package for this. You can just use simple probability functions in base R.
m1 <- 0
m2 <- 2
v1 <- 4
v2 <- 4
range <- seq(-10, 10, length.out=200)
d1<-pnorm(range, m1, sd=sqrt(v1))
d2<-pnorm(range, m2, sd=sqrt(v2))
tpr <- 1-d2
fpr <- 1-d1
plot(fpr, tpr, xlim=0:1, ylim=0:1, type="l")
abline(0,1, lty=2)

Finding Sample Size

I am attempting to use several methods (Wald, Wilson, Clopper-Pearson, Jeffreys, etc.) to calculate sample sizes for confidence intervals. I have been unable to find, in R, how to calculate these. Is there a better way to calculate these besides brute force? Does R have a package that will output all to compare?
I have been unsuccessful with the likes of n.clopper.pearson{GenBinomApps} and some of these require lots of by-hand computations. I have done this for the Wald method:
#Variables
z <- 1.95996
d <- .05
p <- 0.5
q <- 1 - p
#Wald
n_wald <- (z^2 * (p*q))/(d^2)
n_wald
But, I have not been able to find away, besides guess and check methods, to produce the others in R.
I was able to answer my own question with help from the comments:
n_wald <- ciss.wald(p, d, alpha = 0.05)
n_wilson <- ciss.wilson(p, d, alpha = 0.05)
n_agricoull <- ciss.agresticoull(p, d, alpha = 0.05)
These were from the binomSamSize package. Still struggling with an optimization for the clopper-pearson and jeffries if anyone can provide direction there, but these commands calculated sample size easily.

R: Generate data from a probability density distribution

Say I have a simple array, with a corresponding probability distribution.
library(stats)
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)
Is there a way I could generate another set of data using the same distribution. As the operation is probabilistic, it need not exactly match the initial distribution anymore, but will be just generated from it.
I did have success finding a simple solution on my own. Thanks!
Your best bet is to generate the empirical cumulative density function, approximate the inverse, and then transform the input.
The compound expression looks like
random.points <- approx(
cumsum(pdf_of_data$y)/sum(pdf_of_data$y),
pdf_of_data$x,
runif(10000)
)$y
Yields
hist(random.points, 100)
From the examples in the documentation of ?density you (almost) get the answer.
So, something like this should do it:
library("stats")
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)
# From the example.
N <- 1e6
x.new <- rnorm(N, sample(data, size = N, replace = TRUE), pdf_of_data$bw)
# Histogram of the draws with the distribution superimposed.
hist(x.new, freq = FALSE)
lines(pdf_of_data)
You can just reject the draws outside your interval as in rejection sampling.
Alternatively, you can use the algorithm described in the link.
To draw from the curve:
sample(pdf_of_data$x, 1e6, TRUE, pdf_of_data$y)

Resources