retrieve selected variables from caret recursive feature elimination (rfe) results - r

In my working project, I use rfe function from caret package to do recursive feature elimination. I use a toy example to illustrate my point.
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control, metric="ROC")
The optimal variable selected is based on those variables that give highest auroc in the process and can be retrieved by results$optVariables.
However, what I want to do is use '1 standard error rule' to select less features (code below). The number of variables identified is 4.
# auc that is 1-se from the highest auc
df.results = results$results %>% dplyr::mutate(ROCSE = ROCSD/sqrt(10-1))
idx = which.max(df.results$ROC)
ROC.1se = df.results$ROC[idx] - df.results$ROCSE[idx]
# plot ROC vs feature size
g = ggplot(df.results, aes(x=Variables, y=ROC)) +
geom_errorbar(aes(ymin=ROC-ROCSE, ymax=ROC+ROCSE),
width=.2, alpha=0.4, linetype=1) +
geom_line() +
geom_point()+
scale_color_brewer(palette="Paired")+
geom_hline(yintercept = ROC.1se)+
labs(x ="Number of Variables", y = "AUROC")
print(g)
The number of variables I identified is 4. Now I need to know which four variables. I did below:
results$variables %>% filter(Variables==4) %>% distinct(var)
It shows me 5 variables!
Does anyone know how I can retrieve those variables? Basically it applies to get those variables for any number of variables selected.
Thanks a lot in advance!

One-line Answer
If you know you want only the best 4 variables from the rfe resampling, this will give you what you are looking for.
results$optVariables[1:4]
# [1] "glucose" "mass" "age" "pregnant"
dplyr Answer
# results$variables %>%
# group_by(var) %>%
# summarize(Overall = mean(Overall)) %>%
# arrange(-Overall)
#
# A tibble: 8 x 2
# var Overall
# <chr> <dbl>
# 1 glucose 34.2
# 2 mass 15.8
# 3 age 12.7
# 4 pregnant 7.92
# 5 pedigree 5.09
# 6 insulin 4.87
# 7 triceps 3.25
# 8 pressure 1.95
Why your attempt gives more than 4 variables
You are filtering 40 observations. 10 folds of the best 4 variables. The best 4 variables is not always the same within each fold. Hence, to get the best top 4 variables across the resamples you need to average their performance across the folds as the code above does. Even simpler, the variables within optVariables are sorted in this order, so you can just grab the first 4 (as in my one-line answer). The proof that this is the case takes a bit of digging into the source code (shown below).
Details: Digging into the source code
A good first thing to do with objects returned from functions like rfe is to try functions like print, summary, or plot. Often custom methods will exist that will give you very helpful information. For example...
# Run rfe with a random seed
# library(dplyr)
# library(mlbench)
# library(caret)
# data(PimaIndiansDiabetes)
# rfFuncs$summary <- twoClassSummary
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# set.seed(1)
# results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8),
# rfeControl=control, metric="ROC")
#
# The next two lines identical...
results
print(results)
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold)
#
# Resampling performance over subset size:
#
# Variables ROC Sens Spec ROCSD SensSD SpecSD Selected
# 1 0.7250 0.870 0.4071 0.07300 0.07134 0.10322
# 2 0.7842 0.840 0.5677 0.04690 0.04989 0.05177
# 3 0.8004 0.824 0.5789 0.02823 0.04695 0.10456
# 4 0.8139 0.842 0.6269 0.03210 0.03458 0.05727
# 5 0.8164 0.844 0.5969 0.02850 0.02951 0.07288
# 6 0.8263 0.836 0.6078 0.03310 0.03978 0.07959
# 7 0.8314 0.844 0.5966 0.03075 0.04502 0.07232
# 8 0.8316 0.860 0.6081 0.02359 0.04522 0.07316 *
#
# The top 5 variables (out of 8):
# glucose, mass, age, pregnant, pedigree
Hmm, that gives 5 variables, but you said you wanted 4. We can pretty quickly dig into the source code to explore how it is calculating and returning those 5 variables as the top 5 variables.
print(caret:::print.rfe)
#
# Only a snippet code shown below...
# cat("The top ", min(top, x$bestSubset), " variables (out of ",
# x$bestSubset, "):\n ", paste(x$optVariables[1:min(top,
# x$bestSubset)], collapse = ", "), "\n\n", sep = "")
So, basically it is pulling the top 5 variables directly from results$optVariables. How is that getting populated?
# print(caret:::rfe.default)
#
# Snippet 1 of code...
# bestVar <- rfeControl$functions$selectVar(selectedVars,
bestSubset)
#
# Snippet 2 of code...
# bestSubset = bestSubset, fit = fit, optVariables = bestVar,
Ok, optVariables is getting populated by rfeControl$functions$selectVar.
print(rfeControl)
#
# Snippet of code...
# list(functions = if (is.null(functions)) caretFuncs else functions,
From above, we see that caretFuncs$selectVar is being used...
Details: Source code that is populating optVariables
print(caretFuncs$selectVar)
# function (y, size)
# {
# finalImp <- ddply(y[, c("Overall", "var")], .(var), function(x) mean(x$Overall,
# na.rm = TRUE))
# names(finalImp)[2] <- "Overall"
# finalImp <- finalImp[order(finalImp$Overall, decreasing = TRUE),
# ]
# as.character(finalImp$var[1:size])
# }

Related

Extracting statiscal values from a list with multiple lists of results of statistical test

I did a Ljung-Box Test for independence in r with 36 lags and stored the results in a list.
for (lag in c(1:36)){
box.test.list[[lag]] <- (Box.test(btcr, type = "Ljung", lag))
}
I want to extract the p-values as well as the test statistic (X-squared) and print them out to look something like:
X-squared = 100, p-value = 0.0001
I also want to pull it out p-value indivually but rather than just spit out numbers, I want something like:
[1] p-value = 0.001
[2] p-value = 0.0001
and so on. Can this be done?
With the test data
set.seed(7)
btcr <- rnorm(100)
you can perform all your tests with
box.test.list <- lapply(1:36, function(i) Box.test(btcr, type = "Ljung", i))
and then put all the results in a data.frame with
results <- data.frame(
lag = 1:36,
xsquared = sapply(box.test.list, "[[", "statistic"),
pvalue = sapply(box.test.list, "[[", "p.value")
)
Then you can do what you like with the results
head(results)
# lag xsquared pvalue
# 1 1 3.659102 0.05576369
# 2 2 7.868083 0.01956444
# 3 3 8.822760 0.03174261
# 4 4 9.654935 0.04665920
# 5 5 11.190969 0.04772238
# 6 6 12.607454 0.04971085

Create many ROC curves in R?

I have 150 columns of scores against 1 column of label (1/0).
My goal is to create 150 AUC scores.
Here is a manual example:
auc(roc(df$label, df$col1)),
auc(roc(df$label, df$col2)),
...
I can use here Map/sapply/lapply but is there any other method, or function?
This is a bit of an XY question. What you actually want to achieve is speed up your calculation. gfgm's answer answers it with parallelization, but that's only one way to go.
If, as I assume, you are using library(pROC)'s roc/auc functions, you can gain even more speed by selecting the appropriate algorithm for your dataset.
pROC comes with essentially two algorithms that scale very differently depending on the characteristics of your data set. You can benchmark which one is the fastest by passing algorithm=0 to roc:
# generate some toy data
label <- rbinom(600000, 1, 0.5)
score <- rpois(600000, 10)
library(pROC)
roc(label, score, algorithm=0)
Starting benchmark of algorithms 2 and 3, 10 iterations...
expr min lq mean median uq max neval
2 2 4805.58762 5827.75410 5910.40251 6036.52975 6085.8416 6620.733 10
3 3 98.46237 99.05378 99.52434 99.12077 100.0773 101.363 10
Selecting algorithm 3.
Here we select algorithm 3, which shines when the number of thresholds remains low. But if 600000 data points take 5 minutes to compute I strongly suspect that your data is very continuous (no measurements with identical values) and that you have about as many thresholds as data points (600000). In this case you can skip directly to algorithm 2 which scales much better as the number of thresholds in the ROC curve increases.
You can then run:
auc(roc(df$label, df$col1, algorithm=2)),
auc(roc(df$label, df$col2, algorithm=2)),
On my machine each call to roc now takes about 5 seconds, pretty independently of the number of thresholds. This way you should be done in under 15 minutes total. Unless you have 50 cores or more this is going to be faster than just parallelizing. But of course you can do both...
If you want to parallelize the computations you could do it like this:
# generate some toy data
label <- rbinom(1000, 1, .5)
scores <- matrix(runif(1000*150), ncol = 150)
df <- data.frame(label, scores)
library(pROC)
library(parallel)
auc(roc(df$label, df$X1))
#> Area under the curve: 0.5103
auc_res <- mclapply(df[,2:ncol(df)], function(row){auc(roc(df$label, row))})
head(auc_res)
#> $X1
#> Area under the curve: 0.5103
#>
#> $X2
#> Area under the curve: 0.5235
#>
#> $X3
#> Area under the curve: 0.5181
#>
#> $X4
#> Area under the curve: 0.5119
#>
#> $X5
#> Area under the curve: 0.5083
#>
#> $X6
#> Area under the curve: 0.5159
Since most of the computational time seems to be the call to auc(roc(...)) this should speed things up if you have a multi-core machine.
There's a function for doing that in the cutpointr package. It also calculates cutpoints and other metrics, but you can discard those. By default it will try all columns except for the response column as predictors. Additionally, you can select whether the direction of the ROC curve (whether larger values imply the positive class or the other way around) is determined automatically by leaving out direction or set it manually.
dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor",
silent = FALSE)
mc %>% select(variable, direction, AUC)
# A tibble: 4 x 3
variable direction AUC
<chr> <chr> <dbl>
1 Sepal.Length >= 0.933
2 Sepal.Width <= 0.925
3 Petal.Length >= 1.00
4 Petal.Width >= 1.00
By the way, the runtime shouldn't be a problem here because calculating the ROC-curve (even including a cutpoint) takes less than a second for one variable and one million observations using cutpointr or ROCR, so your task runs in about one or two minutes.
If memory is the limiting factor, parallelization will probably make that problem worse. If the above solution takes up too much memory, because it returns ROC-curves for all variables before dropping those columns, you can try selecting the columns of interest right away in a call to map:
# 600.000 observations for 150 variables and a binary outcome
predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))
library(cutpointr)
library(tidyverse)
vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
select(direction, AUC) %>%
mutate(variable = coln)
})
result
# A tibble: 150 x 3
direction AUC variable
<chr> <dbl> <chr>
1 >= 0.500 V2
2 <= 0.501 V3
3 >= 0.501 V4
4 >= 0.501 V5
5 <= 0.501 V6
6 <= 0.500 V7
7 <= 0.500 V8
8 >= 0.502 V9
9 >= 0.501 V10
10 <= 0.500 V11
# ... with 140 more rows

Carrying out a PBIB.test

I have data set from a incomplete lattice design study that I have imported into R from excel and would like to conduct a PBIB.test. However, after running the function as shown below, the output shows object Area not found, even after repeated times.
library("agricolae", lib.loc = "~/R/win-library/3.3")
Rdata2 <- PBIB.test("BlockNo", "AccNo", "Rep", Area, k = 9, c("REML"), console = TRUE)
Error in data.frame(v1 = 1, y) : object 'Area' not found
What is the problem?
See below for a sample application of PBIB.test, based on the agricolae tutorial.
First, create some sample data.
# Construct the alpha design with 30 treatments, 2 repetitions, and block size = 3
Genotype <- c(paste("gen0", 1:9, sep= ""), paste("gen", 10:30, sep= ""));
r <- 2;
k <- 3;
s <- 10;
b <- s * r;
book <- design.alpha(Genotype, k, r,seed = 5);
# Source dataframe
df <- book$book;
Create a vector of response values.
# Response variable
response <- c(
5,2,7,6,4,9,7,6,7,9,6,2,1,1,3,2,4,6,7,9,8,7,6,4,3,2,2,1,1,2,
1,1,2,4,5,6,7,8,6,5,4,3,1,1,2,5,4,2,7,6,6,5,6,4,5,7,6,5,5,4);
Run PBIB.test
model <- with(df, PBIB.test(block, Genotype, replication, response, k = 3, method="REML"))
head(model);
#$ANOVA
#Analysis of Variance Table
#
#Response: yield
# Df Sum Sq Mean Sq F value Pr(>F)
#Genotype 29 72.006 2.4830 1.2396 0.3668
#Residuals 11 22.034 2.0031
#
#$method
#[1] "Residual (restricted) maximum likelihood"
#
#$parameters
# test name.t treatments blockSize blocks r alpha
# PBIB-lsd Genotype 30 3 10 2 0.05
#
#$statistics
# Efficiency Mean CV
# 0.6170213 4.533333 31.22004
#
#$model
#Linear mixed-effects model fit by REML
# Data: NULL
# Log-restricted-likelihood: -73.82968
# Fixed: y ~ trt.adj
# (Intercept) trt.adjgen02 trt.adjgen03 trt.adjgen04 trt.adjgen05 trt.adjgen06
# 6.5047533 -3.6252940 -0.7701618 -2.5264354 -3.1633495 -1.9413054
#trt.adjgen07 trt.adjgen08 trt.adjgen09 trt.adjgen10 trt.adjgen11 trt.adjgen12
# -3.0096514 -4.0648738 -3.5051139 -2.8765561 -1.7111335 -1.6308755
#trt.adjgen13 trt.adjgen14 trt.adjgen15 trt.adjgen16 trt.adjgen17 trt.adjgen18
# -2.2187974 -2.3393290 -2.0807215 -0.3122845 -3.4526453 -1.0320169
#trt.adjgen19 trt.adjgen20 trt.adjgen21 trt.adjgen22 trt.adjgen23 trt.adjgen24
# -3.1257616 0.2101325 -1.7632411 -1.9177848 -1.0500345 -2.5612960
#trt.adjgen25 trt.adjgen26 trt.adjgen27 trt.adjgen28 trt.adjgen29 trt.adjgen30
# -4.3184716 -2.3071359 1.2239927 -1.3643068 -1.4354599 -0.4726870
#
#Random effects:
# Formula: ~1 | replication
# (Intercept)
#StdDev: 8.969587e-05
#
# Formula: ~1 | block.adj %in% replication
# (Intercept) Residual
#StdDev: 1.683459 1.415308
#
#Number of Observations: 60
#Number of Groups:
# replication block.adj %in% replication
# 2 20
#
#$Fstat
# Fit Statistics
#AIC 213.65937
#BIC 259.89888
#-2 Res Log Likelihood -73.82968

Performing lm() and segmented() on multiple columns in R

I am trying to perform lm() and segmented() in R using the same independent variable (x) and multiple dependent response variables (Curve1, Curve2, etc.) one by one. I wish to extract the estimated break point and model coefficients for each response variable. I include an example of my data below.
x Curve1 Curve2 Curve3
1 -0.236422 98.8169 95.6828 101.7910
2 -0.198083 98.3260 95.4185 101.5170
3 -0.121406 97.3442 94.8899 100.9690
4 0.875399 84.5815 88.0176 93.8424
5 0.913738 84.1139 87.7533 93.5683
6 1.795530 73.3582 78.1278 82.9956
7 1.833870 72.8905 77.7093 82.7039
8 1.872200 72.4229 77.3505 82.4123
9 2.907350 59.2070 67.6652 74.5374
10 3.865810 46.4807 58.5158 65.0220
11 3.904150 45.9716 58.1498 64.7121
12 3.942490 45.4626 57.8099 64.4022
13 4.939300 33.3040 48.9742 56.3451
14 4.977640 32.9641 48.6344 56.0352
15 5.936100 24.4682 36.4758 47.0485
16 5.936100 24.4682 36.4758 47.0485
17 6.012780 23.7885 35.9667 46.5002
18 6.971250 20.7387 29.6035 39.6476
19 7.009580 20.6167 29.3490 39.3930
20 8.006390 18.7209 22.7313 32.7753
21 8.121410 18.5022 22.3914 32.1292
22 9.041530 16.4722 19.6728 26.9604
23 9.079870 16.3877 19.5595 26.7450
I am able to do this one curve at a time using the below code. However, my full data set has over 1000 curves, so I would like to be able to repeat this code over every column somehow. I have not been at all successful trying to loop it over every column, so if anyone could show me how to do something like that and create a summary data frame similar to that generated by the below code, but with every column included, I would be extremely grateful. Thanks!
model <- lm(Curve1~x, dat) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients)) # combine breakpoint and coefficeints
colnames(summary_curve1) <- "Curve_1" # header name
summary_curve1 # display summary
Here's an approach using tidyverse and broom to return a data frame containing the results for each Curve column:
library(broom)
library(tidyverse)
model.results = setNames(names(dat[,-1]), names(dat[,-1])) %>%
map(~ lm(paste0(.x, " ~ x"), data=dat) %>%
segmented(seg.Z=~x) %>%
list(model=tidy(.),
psi=data.frame(term="breakpoint", estimate=.[["psi.history"]][[5]]))) %>%
map_df(~.[2:3] %>% bind_rows, .id="Curve")
model.results
Curve term estimate std.error statistic p.value
1 Curve1 (Intercept) 95.866127 0.14972382 640.286416 1.212599e-42
2 Curve1 x -12.691455 0.05220412 -243.112130 1.184191e-34
3 Curve1 U1.x 10.185816 0.11080880 91.922447 1.233602e-26
4 Curve1 psi1.x 0.000000 0.02821843 0.000000 1.000000e+00
5 Curve1 breakpoint 5.595706 NA NA NA
6 Curve2 (Intercept) 94.826309 0.45750667 207.267599 2.450058e-33
7 Curve2 x -9.489342 0.11156425 -85.057193 5.372730e-26
8 Curve2 U1.x 6.532312 1.17332640 5.567344 2.275438e-05
9 Curve2 psi1.x 0.000000 0.23845241 0.000000 1.000000e+00
10 Curve2 breakpoint 7.412087 NA NA NA
11 Curve3 (Intercept) 100.027990 0.29453941 339.608175 2.069087e-37
12 Curve3 x -8.931163 0.08154534 -109.523900 4.447569e-28
13 Curve3 U1.x 2.807215 0.36046013 7.787865 2.492325e-07
14 Curve3 psi1.x 0.000000 0.26319757 0.000000 1.000000e+00
15 Curve3 breakpoint 6.362132 NA NA NA
You can wrap the whole thing in a function, taking as the arguments the column name and the data, and use lapply on the column names, like this:
library(segmented)
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ x, data) # Linear model
seg_model <- segmented(model, seg.Z = ~x) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(dat)[2:ncol(dat)], function(x)run_mod(x, dat))
Which gives the summary for each fitted curve (not sure which output you actually want).
I had the same issue and I'm tryng to adapt the suggested answer, but it appears the following:
Error in model.frame.default(formula = Y ~ Prof, data = data, drop.unused.levels = TRUE) :
invalid type (list) for variable 'Y'
I run this code:
run_mod <- function(varname, data){
data$Y <- data[,varname]
model <- lm(Y ~ Prof, data) # Linear model
seg_model <- segmented(model, seg.Z = ~ Prof) # Segmented model
breakpoint <- as.matrix(seg_model$psi.history[[5]]) # Extract breakpoint
coefficients <- as.matrix(seg_model$coefficients) # Extract coefficients
summary_curve1 <- as.data.frame(rbind(breakpoint, coefficients))
colnames(summary_curve1) <- varname
return(summary_curve1)
}
lapply(names(DATApiv)[3:ncol(DATApiv)], function(Prof)run_mod(Prof, DATApiv))
NOTE: Prof = is the column in my DF the corresponds to independent variable as the x column of this example). DataPiv is my DB.

Simple approach to assigning clusters for new data after k-means clustering

I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.
I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):
df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)
However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?
You could use the flexclust package, which has an implemented predict method for k-means:
library("flexclust")
data("Nclus")
set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)
dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE
cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
# 1 2 3 4
#130 181 98 91
pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:
as.kcca(cl, data=x)
# kcca object of family ‘kmeans’
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
# 1 2
# 50 50
Something I noticed about both the approach in the question and the flexclust approaches are that they are rather slow (benchmarked here for a training and testing set with 1 million observations with 2 features each).
Fitting the original model is reasonably fast:
set.seed(144)
df1 <- data.frame(x=runif(1e6), y=runif(1e6))
df2 <- data.frame(x=runif(1e6), y=runif(1e6))
system.time(km <- kmeans(df1, centers=3))
# user system elapsed
# 1.204 0.077 1.295
The solution I posted in the question is slow at calculating the test-set cluster assignments, since it separately calls closest.cluster for each test-set point:
system.time(pred.test <- apply(df2, 1, closest.cluster))
# user system elapsed
# 42.064 0.251 42.586
Meanwhile, the flexclust package seems to add a lot of overhead regardless of whether we convert the fitted model with as.kcca or fit a new one ourselves with kcca (though the prediction at the end is much faster)
# APPROACH #1: Convert from the kmeans() output
system.time(km.flexclust <- as.kcca(km, data=df1))
# user system elapsed
# 87.562 1.216 89.495
system.time(pred.flexclust <- predict(km.flexclust, newdata=df2))
# user system elapsed
# 0.182 0.065 0.250
# Approach #2: Fit the k-means clustering model in the flexclust package
system.time(km.flexclust2 <- kcca(df1, k=3, kccaFamily("kmeans")))
# user system elapsed
# 125.193 7.182 133.519
system.time(pred.flexclust2 <- predict(km.flexclust2, newdata=df2))
# user system elapsed
# 0.198 0.084 0.302
It seems that there is another sensible approach here: using a fast k-nearest neighbors solution like a k-d tree to find the nearest neighbor of each test-set observation within the set of cluster centroids. This can be written compactly and is relatively speedy:
library(FNN)
system.time(pred.knn <- get.knnx(km$center, df2, 1)$nn.index[,1])
# user system elapsed
# 0.315 0.013 0.345
all(pred.test == pred.knn)
# [1] TRUE
You can use the ClusterR::KMeans_rcpp() function, use RcppArmadillo. It allows for multiple initializations (which can be parallelized if Openmp is available). Besides optimal_init, quantile_init, random and kmeans ++ initilizations one can specify the centroids using the CENTROIDS parameter. The running time and convergence of the algorithm can be adjusted using the num_init, max_iters and tol parameters.
library(scorecard)
library(ClusterR)
library(dplyr)
library(ggplot2)
## Generate data
set.seed(2019)
x = c(rnorm(200000, 0,1), rnorm(150000, 5,1), rnorm(150000,-5,1))
y = c(rnorm(200000,-1,1), rnorm(150000, 6,1), rnorm(150000, 6,1))
df <- split_df(data.frame(x,y), ratio = 0.5, seed = 123)
system.time(
kmrcpp <- KMeans_rcpp(df$train, clusters = 3, num_init = 4, max_iters = 100, initializer = 'kmeans++'))
# user system elapsed
# 0.64 0.05 0.82
system.time(pr <- predict_KMeans(df$test, kmrcpp$centroids))
# user system elapsed
# 0.01 0.00 0.02
p1 <- df$train %>% mutate(cluster = as.factor(kmrcpp$clusters)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("train data")
p2 <- df$test %>% mutate(cluster = as.factor(pr)) %>%
ggplot(., aes(x,y,color = cluster)) + geom_point() +
ggtitle("test data")
gridExtra::grid.arrange(p1,p2,ncol = 2)

Resources