Kmeans: Wrong size of clusters - r

I am running Kmeans algorithm in R on Heart Disease UCI dataset. I am supposed to get 2 clusters with 138 165 size for each like what in the data set.
Steps:
Store dataset in a data frame:
df <- read.csv(".../heart.csv",fileEncoding = "UTF-8-BOM")
Extract the features:
features = subset(df, select = -target)
Normalize it:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
features = data.frame(sapply(features, normalize))
Run the algorithm:
set.seed(0)
cluster = kmeans(features, 2)
cluster$size
Output:
[1] 99 204
Why?

It seems like you're focusing on the size of the clusters rather than the accuracy of your predictions. You may well get two clusters of size (138, 165) but not necessarily the same clusters as the 'target' column in the data.
A better way of judging performance is the accuracy of your predictions. In your case, your model accuracy is 72%. You can see this by:
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
# 0 1
# 0 76 62
# 1 23 142
#
# Accuracy : 0.7195
# ...
I was able to get a better accuracy by standardizing the data rather than normalizing. Possibly because standardizing is more robust to outliers.
I also dummy-coded the categorical looking variables which seems to have improved the accuracy. We now have 85% accuracy and the cluster size is closer to what we expect (143 160). Although, as discussed, on its own the cluster size is meaningless.
library(dplyr)
library(fastDummies)
library(caret)
standardize <- function(x){
num <- x - mean(x, na.rm=T)
denom <- sd(x, na.rm=T)
num/denom
}
# dummy-code and standardize
features <- select(df, -target) %>%
dummy_cols(select_columns = c('cp','thal', 'ca'),
remove_selected_columns = T,remove_first_dummy = T) %>%
mutate_all(standardize)
set.seed(0)
cluster <- kmeans(features, centers = 2, nstart = 50)
cluster$size
# 143 160
# check predictions vs actual labels
df$label <- cluster$cluster -1
confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
#
# 0 1
# 0 117 21
# 1 26 139
#
# Accuracy : 0.8449
Of course, there are other accuracy metrics worth considering too such as out-of-sample accuracy (split your data into training and test sets, and calculate accuracy of predictions on your test set), and f1-score.

Here is an example that should help you get things straightened out.
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
df <- USArrests
df <- na.omit(df)
df <- scale(df)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)
fviz_cluster(k2, data = df)
[![enter image description here][1]][1]
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
set.seed(123)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
[![enter image description here][1]][1]
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)
fviz_cluster(final, data = df)
https://uc-r.github.io/kmeans_clustering

Related

Plotting from a for loop

set.seed(666999)
PLOT <- function(n){
n_samples<-10
#Creating matrix to store generated data
data_matrix<-matrix(ncol =n_samples, nrow= n)
for(j in 1:n_samples){
#Generating 10 standard normal samples of size n
data_matrix[ , j]<-rnorm(n=n, mean = 0, sd = 1)
}
for(k in 1:n_samples){
sam_20<-data_matrix[ , k]
# extracting each of the kth samples
#Ploting
Q_Qplot<-qqnorm( sam_20 )
Q_QplotL<-qqline( sam_20 )
#Q_Qplot<-gg_qqplot(sam_20, ylab="Sample Quantiles",
#xlab = "Theoritical Quantiles",
#main= bquote("Q-Q plot for sample size of "
# ~ n == ~ .(n)))#quantile-quantile plot
}
# return(n)
#return(Q_Qplot)
#return(Q_QplotL)
}
#layout_matrix_1 <- matrix(1:10, ncol = 5) # Define position matrix
#layout(layout_matrix_1)
PLOT(100)
PLOT(50)
PLOT(20)
PLOT(10)
For each sample size, I am generating ten plots. I want the layout for each of my 10 plots to fit well on a page with reasonable height and width. For some reason, I can't get it to work the way I wanted. I need some assistance perhaps with ggplot2 which has a better aesthetic view. Thanks
An approach with tidyverse
library(tidyverse)
#Values to simulate
sample_size <- c(10,20,50,100)
sample_num <- 1:10
#Create data.frame
expand_grid(sample_size,sample_num) %>%
#Map values to simulate cenarios
mutate(y = map(sample_size,sim_data)) %>%
unnest(cols = c(y)) %>%
ggplot(aes(sample = y))+
#qqplot
stat_qq()+
#qqline with color blue
stat_qq_line(col = "blue")+
#facet sample_size x sample_num
facet_grid(cols = vars(sample_size),rows = vars(sample_num))

How to plot vector of bootstrapped slopes in ggplot2?

I've been using ggplot2 to plot the results of bootstrapping various statistical outputs such as correlation coefficients. Most recently, I bootstrapped the slope of a linear regression model. Here's how that looks using the plot() function from the graphics package:
plot(main="Relationship Between Eruption Length at Wait Time at \n
Old Faithful With Bootstrapped Regression Lines",
xlab = "Eruption Length (minutes)",
ylab = "Wait Time (minutes)",
waiting ~ eruptions,
data = faithful,
col = spot_color,
pch = 19)
index <- 1:nrow(faithful)
for (i in 1:10000) {
index_boot <- sample(index, replace = TRUE) #getting a boostrap sample (of indices)
faithful_boot <- faithful[index_boot, ]
# Fitting the linear model to the bootstrapped data:
fit.boot <- lm(waiting ~ eruptions, data = faithful_boot)
abline(fit.boot, lwd = 0.1, col = rgb(0, 0.1, 0.25, alpha = 0.05)) # Add line to plot
}
fit <- lm(waiting ~ eruptions, data=faithful)
abline(fit, lwd = 2.5, col = "blue")
That works, but it depends on a workflow where we first create a plot, then add the lines in a loop. I'd rather create a list of slopes with a function and then plot all of them in ggplot2.
For example, the function might look something like this:
set.seed(777) # included so the following output is reproducible
n_resample <- 10000 # set the number of times to resample the data
# First argument is the data; second is the number of resampled datasets
bootstrap <- function(df, n_resample) {
slope_resample <- matrix(NA, nrow = n_resample) # initialize vector
index <- 1:nrow(df) # create an index for supplied table
for (i in 1:n_resample) {
index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
df_boot <- df[index_boot, ] # create a bootstrap sample from original data
a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
slope_resample[i] <- slope <- a$coefficients[2] # take the slope
}
return(slope_resample) # Return a vector of differences of proportion
}
bootstrapped_slopes <- bootstrap(faithful, 10000)
But how to get geom_line() or geom_smooth() to take the data from bootstrapped_slopes? Any assistance is much appreciated.
EDIT: More direct adaptation from the OP
For plotting, I presume you want both the slopes and the intercepts, so here's a modified bootstrap function:
bootstrap <- function(df, n_resample) {
# Note 2 dimensions here, for slope and intercept
slope_resample <- matrix(NA, 2, nrow = n_resample) # initialize vector
index <- 1:nrow(df) # create an index for supplied table
for (i in 1:n_resample) {
index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
df_boot <- df[index_boot, ] # create a bootstrap sample from original data
a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
slope_resample[i, 1] <- slope <- a$coefficients[1] # take the slope
slope_resample[i, 2] <- intercept <- a$coefficients[2] # take the intercept
}
# Return a data frame with all the slopes and intercepts
return(as.data.frame(slope_resample))
}
Then run it and plot the lines from that data frame:
bootstrapped_slopes <- bootstrap(faithful, 10000)
library(dplyr); library(ggplot2)
ggplot(faithful, aes(eruptions, waiting)) +
geom_abline(data = bootstrapped_slopes %>%
sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
aes(slope = V2, intercept = V1), #, group = id),
alpha = 0.01) +
geom_point(shape = 19, color = "red")
Alternative solution
This could also be done using modelr and broom to simplify some of the bootstrapping. Based on the main help example for modelr::bootstrap, we can do the following:
library(purrr); library(modelr); library(broom); library(dplyr)
set.seed(777)
# Creates bootstrap object with 10k extracts from faithful
boot <- modelr::bootstrap(faithful, 10000)
# Applies the linear regression to each
models <- map(boot$strap, ~ lm(waiting ~ eruptions, data = .))
# Extracts the model results into a tidy format
tidied <- map_df(models, broom::tidy, .id = "id")
# We just need the slope and intercept here
tidied_wide <- tidied %>% select(id, term, estimate) %>% spread(term, estimate)
ggplot(faithful, aes(eruptions, waiting)) +
geom_abline(data = tidied_wide %>%
sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
aes(slope = eruptions, intercept = `(Intercept)`, group = id), alpha = 0.05) +
geom_point(shape = 19, color = "red") # spot_color wasn't provided in OP

LMS (Lambda-Mu-Sigma) method in R

I want to create percentile curves for my data using LMS (Lambda-Mu-Sigma) method. I have following example data. How can 10th, 50th and 90th percentile curves of yvar (on y-axis) vs age (on x-axis) be drawn using LMS?
age = sample(5:75, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 61 87.16011
2 58 49.73289
3 65 15.60212
4 71 83.32699
5 33 40.89592
6 18 25.04376
plot(age, yvar)
I came across VGAM package http://www.inside-r.org/packages/cran/VGAM/docs/lms.bcn . Is that the best method to do it? I could not really understand its example code to create simple percentile curve from above data. Thanks for your help.
Simulate data (reproducibly):
set.seed(1001)
mydata <- data.frame(
age = sample(5:75, 500, replace=TRUE))
mydata <- transform(mydata,
yvar = rnorm(500, age, 20))
Since the LMS method typically appears to be based on variants of the Box-Cox transformation, which requires positive values, a simpler way to do this would be to use quantile regression.
library("quantreg")
library("ggplot2"); theme_set(theme_bw())
g0 <- ggplot(mydata,aes(x=age,y=yvar))+geom_point()
g0 + geom_smooth(method="rq",tau=c(0.1),se=FALSE,lty=2)+
geom_smooth(method="rq",tau=c(0.5),se=FALSE)+
geom_smooth(method="rq",tau=c(0.9),se=FALSE,lty=2)
rq() by itself has the capability to fit all three percentiles at the same time, but you need to use the strategy suggested in this blog post to draw them more conveniently:
model.rq <- rq(yvar ~ age, mydata, tau=c(0.1, 0.5, 0.9))
quantile.regressions <- data.frame(t(coef(model.rq)))
colnames(quantile.regressions) <- c("intercept", "slope")
quantile.regressions$quantile <- rownames(quantile.regressions)
g0 + geom_abline(aes(intercept=intercept, slope=slope,
colour=quantile), show_guide=TRUE, data=quantile.regressions)
Alternatively it is possible to do this within VGAM, but I'm not sure whether it's what you want/whether the results make sense or not. The Yeo-Johnson transformation, via lms.yjn, allows you to do this even when some data values are negative, but you might look at ?lms.bcg, ?lms.bcn for alternatives that work for non-negative data.
library("VGAM")
fit <- vgam(yvar ~ s(age, df = 4), lms.yjn, data=mydata,
control=vgam.control(maxit=100),
trace=FALSE)
We get a warning message:
## Warning message:
## In vgam.fit(x = x, y = y, w = w, mf = mf, Xm2 = Xm2, Ym2 = Ym2, :
## convergence not obtained in 100 iterations
This might be because we're overfitting the data using a 4-knot spline model?
Quantile plot (following example("lms.yjn"))
par(bty = "l", mar = c(5, 4, 4, 3) + 0.1, xpd = TRUE)
qtplot(fit, percentiles = c(10, 50, 90),
las = 1, ylab = "yvar", lwd = 2, lcol = 4)
This is a terrible hack, but if you want access to the raw values so you can plot the curves yourself:
pcurves <- qtplot.lmscreg(fit,show.plot=FALSE,
percentiles=c(10,50,90))
vals <- data.frame(age=mydata$age,pcurves$fitted.values)
vals <- vals[order(vals$age),]
matplot(vals$age,vals[,-1],type="l",lty=c(2,1,2),col=1,
xlab="age",ylab="")

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Extract prediction band from lme fit

I have following model
x <- rep(seq(0, 100, by=1), 10)
y <- 15 + 2*rnorm(1010, 10, 4)*x + rnorm(1010, 20, 100)
id <- NULL
for(i in 1:10){ id <- c(id, rep(i,101)) }
dtfr <- data.frame(x=x,y=y, id=id)
library(nlme)
with(dtfr, summary( lme(y~x, random=~1+x|id, na.action=na.omit)))
model.mx <- with(dtfr, (lme(y~x, random=~1+x|id, na.action=na.omit)))
pd <- predict( model.mx, newdata=data.frame(x=0:100), level=0)
with(dtfr, plot(x, y))
lines(0:100, predict(model.mx, newdata=data.frame(x=0:100), level=0), col="darkred", lwd=7)
with predict and level=0 i can plot the mean population response. How can I extract and plot the 95% confidence intervals / prediction bands from the nlme object for the whole population?
Warning: Read this thread on r-sig-mixed models before doing this. Be very careful when you interpret the resulting prediction band.
From r-sig-mixed models FAQ adjusted to your example:
set.seed(42)
x <- rep(0:100,10)
y <- 15 + 2*rnorm(1010,10,4)*x + rnorm(1010,20,100)
id<-rep(1:10,each=101)
dtfr <- data.frame(x=x ,y=y, id=id)
library(nlme)
model.mx <- lme(y~x,random=~1+x|id,data=dtfr)
#create data.frame with new values for predictors
#more than one predictor is possible
new.dat <- data.frame(x=0:100)
#predict response
new.dat$pred <- predict(model.mx, newdata=new.dat,level=0)
#create design matrix
Designmat <- model.matrix(eval(eval(model.mx$call$fixed)[-2]), new.dat[-ncol(new.dat)])
#compute standard error for predictions
predvar <- diag(Designmat %*% model.mx$varFix %*% t(Designmat))
new.dat$SE <- sqrt(predvar)
new.dat$SE2 <- sqrt(predvar+model.mx$sigma^2)
library(ggplot2)
p1 <- ggplot(new.dat,aes(x=x,y=pred)) +
geom_line() +
geom_ribbon(aes(ymin=pred-2*SE2,ymax=pred+2*SE2),alpha=0.2,fill="red") +
geom_ribbon(aes(ymin=pred-2*SE,ymax=pred+2*SE),alpha=0.2,fill="blue") +
geom_point(data=dtfr,aes(x=x,y=y)) +
scale_y_continuous("y")
p1
Sorry for coming back to such an old topic, but this might address a comment here:
it would be nice if some package could provide this functionality
This functionality is included in the ggeffects-package, when you use type = "re" (which will then include the random effect variances, not only residual variances, which is - however - the same in this particular example).
library(nlme)
library(ggeffects)
x <- rep(seq(0, 100, by = 1), 10)
y <- 15 + 2 * rnorm(1010, 10, 4) * x + rnorm(1010, 20, 100)
id <- NULL
for (i in 1:10) {
id <- c(id, rep(i, 101))
}
dtfr <- data.frame(x = x, y = y, id = id)
m <- lme(y ~ x,
random = ~ 1 + x | id,
data = dtfr,
na.action = na.omit)
ggpredict(m, "x") %>% plot(rawdata = T, dot.alpha = 0.2)
ggpredict(m, "x", type = "re") %>% plot(rawdata = T, dot.alpha = 0.2)
Created on 2019-06-18 by the reprex package (v0.3.0)

Resources