speed up model fitting in foreach loop in R - r

I have a very large dataframe with > 5 millions obs. and 7 var.
I have to do thousands of bootstrap resampling and create a model with each sample, make predictions and save the results in a dataframe where each column is the prediction of one of the models
here is my code:
library(caret)
library(randomForest)
library(foreach)
library(doParallel)
#very small database example
db <-
data.frame(Y = as.factor(sample(c(0, 1), 1000, replace = T)),
X1 = sample(c(100:2000), 1000, replace = T),
X2=sample(c(100:2000), 1000, replace = T),
X3 = sample(c(100:2000), 1000, replace = T),
X4 = sample(c(100:2000), 1000, replace = T))
#bootstrap 100 times
times <- 100
set.seed(5)
#create 1000 resample indexes
res <- createResample(db$Y,times = times)
db_list <- list()
#list of database resampled
for (i in 1:times) {
db_list[[i]] <- db[res[[i]],]
}
#create 100 models from the db_list and make prediction
model_list <- list()
pred <- data.frame(Y=rep(NA,nrow(db)))
#on my machin this loop take 28.10 sec elapsed
system.time(for (l in seq_len(times)) {
set.seed(5)
model_list[[l]] <- randomForest(Y~.,data=db_list[[l]])
pred[,l] <- predict(model_list[[l]],newdata = db[,-1])#predict from the original Xn variables
db_pred <- db
db_pred <- cbind(db_pred,pred)
})
#make cluster with 11 core in my machine
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)
#on my machin this loop take 23.25 sec elapsed
system.time(foreach(l = seq_len(times),.packages = "randomForest")%dopar% {
set.seed(5)
model_list[[l]] <- randomForest(Y~.,data=db_list[[l]])
pred[,l] <- predict(model_list[[l]],newdata = db[,-1])#predict from the original Xn variables
db_pred <- db
db_pred <- cbind(db_pred,pred)
})
stopCluster(cl)
the code work well with a small dataframe but take forever with my data, despite the foreach loop.
I've tried with caret::train() but I have not found the way to store the predictions of all the models in a dataframe.
The question are:
how can I speed up the process?
Are there classifiers faster than randomforest that could be used with this data?
Thanks in advance for any help

Related

speed up replication of rows using model

I would like to create replicate predictions for one integer independent variable (iv1) given some model and a data frame called training. This is my current approach. I appreciate this is not self containing but hopefully it is self explanatory:
number_of_samples <- 10
results <- NULL
for (row in 1:nrow(training)) {
fake_iv1_values <- sample(1:100, number_of_samples)
case <- training[row,]
for (iv1 in fake_iv1_values) {
case$iv1 <- iv1
case$prediction <- predict(some_model, newdata = case)
results <- rbind(results, case)
}
}
Using loops is very slow. I wonder, if this could be sped up? Thanks!
Try with this.
Reproducible fake data and model:
# create fake data
n_row <- 100
n_xs <- 100
training <- data.frame(y = rnorm(n_row), iv1 = rnorm(n_row))
training[, paste0("x",1:n_xs)] <- replicate(n_xs, list(rnorm(n_row)))
# example model
some_model <- lm(y~., training)
Rewritten code:
number_of_samples <- 10
results <- NULL
# vector of several fake_iv1_values vectors
fake_iv1_values <- as.numeric(replicate(nrow(training), sample(1:100, number_of_samples)))
# replicate each row of the original dataframe
results <- training[rep(seq_len(nrow(training)), each = number_of_samples), ]
# add fake values to the replicated dataframe
results$iv1 <- fake_iv1_values
# get predictions
results$prediction <- predict(some_model, newdata = results)

R list containing training set and test set objects

I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code.
I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(data), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- data[test_ind,]
train <- data[-test_ind,]
data_pair <- list(training_set = list(train), test_set = list(test))
data_pairs <- append(x = data_pairs, values = data_pair)
}
You were very close, you just needed to wrap values in a list call.
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(mtcars), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- mtcars[test_ind,]
train <- mtcars[-test_ind,]
data_pair <- list(training_set = train, test_set = test)
data_pairs <- append(x = data_pairs, values = list(data_pair))
#data_pairs <- c(data_pairs, list(data_pair))
}
If your data is big I would suggest you read these two posts on more efficient ways to grow a list.
Append an object to a list in R in amortized constant time, O(1)?
Here we go again: append an element to a list in R
I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.
It seems like the package modelr could help you here.
In particular I would point you to:
https://modelr.tidyverse.org/reference/resample_partition.html
library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
mod <- lm(mpg ~ wt, data = ex$train)
rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216
Alternatively, producing a dataset of these partitions can be done with:
crossv_mc(data, n, test = 0.2, id = ".id")

Plotting critical differences in R with imported data

A critical difference (CD) plot for comparing classifiers over multiple data sets (Demšar2006) can be generated with the mlr package like this:
# THIS WORKS
library(mlr)
lrns = list(makeLearner("classif.knn"), makeLearner("classif.svm"))
tasks = list(iris.task, sonar.task)
rdesc = makeResampleDesc("CV", iters = 2L)
meas = list(acc)
bmr = benchmark(lrns, tasks, rdesc, measures = meas)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
This requires the evaluation results to reside in a rather complex BenchmarkResult object, although the data is basically a matrix (where M[i, j] holds the score of classifier i for data set j).
I have previously generated such data in a Python workflow and imported in R into a data.frame (as there seems to be no Python package for such plots).
How can I generate a CD plot from this data?
I thought about creating a BenchmarkResult from the data.frame, but didn't know where to start:
# THIS DOES NOT WORK
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results = data.frame(replicate(5, runif(30, 0, 1)))
# This is the functionality I'm looking for
bmr = benchmarkResultFromDataFrame(results)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
I finally managed to create the plot. It is necessary to set only a handful of the BenchmarkResult's attributes:
leaners with id and short.name for each classifier
measures
results with aggr for each dataset/classifier combination
The code may then look like this (smaller example of 5 datasets):
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results <- data.frame(replicate(5, runif(30, 0, 1)))
clf <- c('clf1', 'clf2', 'clf3', 'clf4', 'clf5')
clf.short.name <- c('c1', 'c2', 'c3', 'c4', 'c5')
dataset <- c('dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5')
score <- list(acc)
# Setting up the learners: id, short.name
bmr <- list()
for (i in 1:5){
bmr$learners[[clf[i]]]$id <- clf[i]
bmr$learners[[clf[i]]]$short.name <- clf.short.name[i]
}
# Setting up the measures
bmr$measures <- list(acc)
# Setting up the results
for (i in 1:5){
bmr$results$`dataset1`[[clf[i]]]$aggr <- list('acc.test.mean' = results[1, i])
}
for (i in 1:5){
bmr$results$`dataset2`[[clf[i]]]$aggr <- list('acc.test.mean' = results[2, i])
}
for (i in 1:5){
bmr$results$`dataset3`[[clf[i]]]$aggr <- list('acc.test.mean' = results[3, i])
}
for (i in 1:5){
bmr$results$`dataset4`[[clf[i]]]$aggr <- list('acc.test.mean' = results[4, i])
}
for (i in 1:5){
bmr$results$`dataset5`[[clf[i]]]$aggr <- list('acc.test.mean' = results[5, i])
}
# Set BenchmarkResult class
class(bmr) <- "BenchmarkResult"
# Statistics and plot
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
Anyone who could teach me better R to avoid these for loops and code duplication would still be very welcome!

Parallel Computing in R : how to use the cores

I am currently trying parallel computing in R.
I am trying to train a logistic ridge model , and I currently have 4 Cores on my computer. I would like to split my data set equally into 4 pieces, and use each core to train model (on the training data) and save the result of each core into a single vector . the problem is that i have no clue how to do it, right now I tried to parallel with the foreach package, but the problem is the each core sees the same training data. here is the code with the foreach package (which doesn't split the data) :
library(ridge)
library(parallel)
library(foreach)
num_of_cores <- detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)
result <- data.frame()
r <- foreach(icount(4), .combine = cbind) %dopar% {
result <- logisticRidge(admit~ gre + gpa + rank,data = mydata)
coefficients(result)
}
any idea how to simultaneously split the data into x chunks and train the models in parallel ?
How about something like this? It uses snowfall instead of the foreach-library, but should give the same results.
library(snowfall)
library(ridge)
# for reproducability
set.seed(123)
num_of_cores <- parallel::detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)
# we take random rows to each cluster, by sampleid
mydata$sampleid <- sample(1:num_of_cores, nrow(mydata), replace = T)
# create a small function that calculates the coefficients
regfun <- function(dat) {
library(ridge) # this has to be in the function, otherwise snowfall doesnt know the logistic ridge function
result <- logisticRidge(admit~ gre + gpa + rank, data = dat)
coefs <- as.numeric(coefficients(result))
return(coefs)
}
# prepare the data
datlist <- lapply(1:num_of_cores, function(i){
dat <- mydata[mydata$sampleid == i, ]
})
# initiate the clusters
sfInit(parallel = T, cpus = num_of_cores)
# export the function and the data to the cluster
sfExport("regfun")
# calculate, (sfClusterApply is very similar to sapply)
res <- sfClusterApply(datlist, function(datlist.element) {
regfun(dat = datlist.element)
})
#stop the cluster
sfStop()
# convert the list to a data.frame. data.table::rbindlist(list(res)) does the same job
res <- data.frame(t(matrix(unlist(res), ncol = num_of_cores)))
names(res) <- c("intercept", "gre", "gpa", "rank")
res
# res
# intercept gre
# 1 -3.002592 1.558363e-03
# 2 -4.142939 1.060692e-03
# 3 -2.967130 2.315487e-03
# 4 -1.176943 4.786894e-05
# gpa rank
# 1 0.7048146997 -0.382462408
# 2 0.9978841880 -0.314589628
# 3 0.6797382218 -0.464219036
# 4 -0.0004576679 -0.007618317
The itertools package provides a number of functions for iterating over various data structures with foreach loops. In this case, you could use the isplitRows function to split the data frame row-wise into one chunk per worker:
library(ridge)
library(doParallel)
library(itertools)
num_of_cores <- detectCores()
cl <- makePSOCKcluster(num_of_cores)
registerDoParallel(cl)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
r <- foreach(d=isplitRows(mydata, chunks=num_of_cores),
.combine = cbind, .packages="ridge") %dopar% {
result <- logisticRidge(admit~ gre + gpa + rank, data = d)
coefficients(result)
}
isplitRows also takes a chunkSize argument if you want to control the maximum size of each chunk.
Note that using this technique, each worker only receives an appropriate fraction of mydata. This is particularly important for larger data frames with a PSOCK cluster.

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Resources