How to Iteratively run Kmeans / Random Forest and compare accuracy - r

original data:
datausage mou revenue calldrop handset2g handset3g smartphone
1 896804.7 2854801 40830.404 27515 7930 19040 20810
2 155932.1 419109 5512.498 5247 2325 2856 3257
3 674983.3 2021183 25252.265 21068 6497 13056 14273
4 522787.2 1303221 14547.380 8865 4693 9439 10746
5 523465.7 1714641 24177.095 25441 8668 12605 14766
6 527062.3 1651303 20153.482 18219 6822 11067 12994
rechargecount rechargesum arpu subscribers
1 4461 235430 197704.10 105822
2 843 39820 34799.21 18210
3 2944 157099 133842.38 71351
4 2278 121697 104681.58 44975
5 2802 144262 133190.55 75860
6 2875 143333 119389.91 63740
I have the following PCA data on which i am doing Kmeans clustering:
1 -9.204228 -2.73517110 2.7975063 0.6794614 -0.84627095 0.4455297
2 2.927245 0.05666389 0.5085896 0.1472800 0.18193152 0.1041490
3 -4.667932 -1.98176361 2.2751862 0.5347725 -0.43314927 0.3222719
4 -1.366505 -0.40858595 0.5005192 0.4507366 -0.54996933 0.5533013
5 -4.689454 -2.77185636 2.4323856 0.7387788 0.49237229 -0.4817083
6 -3.477046 -1.84904214 1.5539558 0.5463861 -0.03231143 0.2814843
set.seed(115) <- kmeans(pcdffinal,opt.cluster,nstart=25)
[1] 2 1 2 2 2 2
barplot(table($cluster), col="steelblue")$tot.withinss #For total within cluster sum of squares.
[1] 13172.59
We can also use a plot to illustrate the groups that the data have been arranged into.
plot(pcdffinal[,1:2],col=($cluster+1),main=paste('K-Means Clustering result with k = ', opt.cluster,sep=" "),pch=20,cex=2)
points($centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
fviz_cluster(, data = pcdffinal, frame.type = "convex")+ theme_minimal()
df.num_kmeans$cluster.kmeans <-$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
save this dataset & kmeans model for further use
saveRDS(, "kmeans_model.RDS")
clusplot(df.num_kmeans,$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
autoplot(, data=pcdffinal, frame=TRUE,frame.type='norm')
Then i run random forest model and compute accuracy:
dfnum.kmeans <- read.csv("dfnum_kmeans.cluster.csv")
table(dfnum.kmeans$cluster.kmeans) # size of each cluster
convert cluster var into a factor
dfnum.kmeans$cluster.kmeans <- as.factor(dfnum.kmeans$cluster.kmeans)
create training and test sets (75:25 split) using 'caret' package
set.seed(128) # for reproducibility
inTrain_kmeans <- caret::createDataPartition(y = dfnum.kmeans$cluster.kmeans, p = 0.75, list = FALSE)
training_kmeans <- dfnum.kmeans[inTrain_kmeans, ]
testing_kmeans <- dfnum.kmeans[-inTrain_kmeans, ]
control <- trainControl(method = "repeatedcv", number = 10,allowParallel = TRUE)
modFit.rfcaret_kmeans <- caret::train(cluster.kmeans~ ., method = "rf",data = training_kmeans, trControl = control, number = 25)
pred.test_kmeans = predict(modFit.rfcaret_kmeans, testing_kmeans); confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )
confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )$overall[1]
I would like to do Kmeans and random forest for a range of Ks say k=2:6 each time making plots for the respective k as well as saving the models as well as the data as a csv but each done separately for different k's.Then for random forest would like to import the above saved csv's, create train & test data,run random forest model,then predict for each test data and finally compute confusion matrix and accuracy ....thus iteratively get the solution
Need help to convert the above codes into an iterative with the counter i going from 2 till 6.


retrieve selected variables from caret recursive feature elimination (rfe) results

In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.
rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")
The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables.
However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.
# auc that is 1-se from the highest auc
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]
# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) +
geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE),
width=.2, alpha=0.4, linetype=1) +
geom_line() +
geom_hline(yintercept = ROC.1se)+
labs(x ="Number of Variables", y = "AUROC")
The number of variables I identified is 4. Now I need to know which four variables. I did below:
results$variables %>% filter(Variables==4) %>% distinct(var)
It shows me 5 variables!
Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.
Thanks a lot in advance!
One-line Answer
If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.
# [1] "glucose" "mass" "age" "pregnant"
dplyr Answer
# results$variables %>%
# group_by(var) %>%
# summarize(Overall = mean(Overall)) %>%
# arrange(-Overall)
# A tibble: 8 x 2
# var Overall
# <chr> <dbl>
# 1 glucose 34.2
# 2 mass 15.8
# 3 age 12.7
# 4 pregnant 7.92
# 5 pedigree 5.09
# 6 insulin 4.87
# 7 triceps 3.25
# 8 pressure 1.95
Why your attempt gives more than 4 variables
You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).
Details: Digging into the source code
A good first thing to do with objects returned from functions like rfe is to try functions like print, summary, or plot. Often custom methods will exist that will give you very helpful information. For example...
# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
# rfeControl=control, metric="ROC")
# The next two lines identical...
# Recursive feature selection
# Outer resampling method: Cross-Validated (10 fold)
# Resampling performance over subset size:
# Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
# 1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322
# 2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177
# 3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456
# 4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727
# 5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288
# 6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959
# 7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232
# 8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316 *
# The top 5 variables (out of 8):
# glucose, mass, age, pregnant, pedigree
Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.
# Only a snippet code shown below...
# cat("The top ", min(top, x$bestSubset), " variables (out of ",
# x$bestSubset, "):\n ", paste(x$optVariables[1:min(top,
# x$bestSubset)], collapse = ", "), "\n\n", sep = "")
So, basically it is pulling the top 5 variables directly from results$optVariables. How is that getting populated?
# print(caret:::rfe.default)
# Snippet 1 of code...
# bestVar <- rfeControl$functions$selectVar(selectedVars,
# Snippet 2 of code...
# bestSubset = bestSubset, fit = fit, optVariables = bestVar,
Ok, optVariables is getting populated by rfeControl$functions$selectVar.
# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions,
From above, we see that caretFuncs$selectVar is being used...
Details: Source code that is populating optVariables
# function (y, size)
# {
# finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall,
# na.rm = TRUE))
# names(finalImp)[2] <- "Overall"
# finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE),
# ]
# as.character(finalImp$var[1:size])
# }

How can I use SOM algorithm for classification prediction

I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale")) <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4
This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
cat <- 0
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
cat <- 0
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)

Assuming that your original dataframe is df.num, the following could save all the files (for different k values) in your working directory:
for (k in 2:6) {
set.seed(115) <- kmeans(pcdffinal,k,nstart=25)
print(paste(k,$tot.withinss)) #For total within cluster sum of squares.
png(paste0('kmeans_proj_',k, '.png'))
plot(pcdffinal[,1:2],col=($cluster+1),main=paste('K-Means Clustering result with k = ', k,sep=" "),pch=20,cex=2)
points($centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
png(paste0('kmeans_fviz_',k, '.png'))
print(fviz_cluster(, data = pcdffinal, frame.type = "convex")+ theme_minimal())
df.num_kmeans$cluster.kmeans <-$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
saveRDS(, paste0("kmeans_model_", k, ".RDS"))
write.csv(df.num_kmeans,paste0("dfnum_kmeans_", k, ".cluster.csv"))
png(paste0('clusplot_',k, '.png'))
clusplot(df.num_kmeans,$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
png(paste0('autoplot_',k, '.png'))
print(autoplot(, data=pcdffinal, frame=TRUE,frame.type='norm'))

How to read the indexes from the prediction output of predict.ranger, R

Using the ranger package I run the following script:
rf <- ranger(Surv(time, Y) ~ ., data = train_frame[1:50000, ], write.forest = TRUE, num.trees = 100)
test_frame <- train_frame[50001:100000, ]
preds <- predict(rf, test_frame)
chfs <- preds$chf
plot(chfs[1, ])
The cumulative hazard function has indexes 1 - 36 on the X-axis. Obviously this corresponds with time, but I'm not sure how: my time of observation variable ranges from a minimum of 0 to a maximum of 399. What is the mapping between the original data and the predicted output from predict.ranger, and how can I operationalize this to quantify degree of risk for a given subject after a given length of time?
Here's a sample of what my time/event data looks like:
Y time
<int> <dbl>
1 1 358
2 0 90
3 0 162
4 0 35
5 0 307
6 0 69
7 0 184
8 0 24
9 0 366
10 0 33
And here's what the CHF of the first subject looks like:
Can anyone help me connect the dots? There are no row or columns names on the "matrix" object that is preds$chf.
In the prediction object is vector called unique.death.times containing the time points where the CHF and survival estimates are computed. The chf matrix has observations in the rows and these time points in the columns, same for survival.
Reproducible example:
## Split the data
n <- nrow(veteran)
idx <- sample(n, 2/3*n)
train <- veteran[idx, ]
test <- veteran[-idx, ]
## Grow RF and predict
rf <- ranger(Surv(time, status) ~ ., train, write.forest = TRUE)
preds <- predict(rf, test)
## Example CHF plot
plot(preds$unique.death.times, preds$chf[1, ])
## Example survival plot
plot(preds$unique.death.times, preds$survival[1, ])
Simulating an interaction effect in a lmer() model in R

Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.
I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
#participant ID
ID<-rep(1:77, each=3)
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
#random effect intercept <- rnorm(77, 0, sqrt(Vint))
#random effect for slope<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+[ID][ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,[ID],[ID],eps)
#run the model
#grab variance components
#Select Tau and Sigma to select in the out object
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
Try plotLMER.fnc() from the languageR package, or the effects package.
The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
