Can I make this R foreach loop faster? - r

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

How long for forecast::auto.arima to work?

I am working through some demo code that accompanied a medium post on high frequency time series forecasting using the forecast::auto.arima function. Whether in this application or when I have tried other datasets, I have never been able to get a result from this function - it does seem to stop calculating once I have executed it. Others have, obviously; so I'm asking how long you need to wait to get a result from the function.
Example
Data on hourly energy use can be downloaded from Kaggle.
Using that data:
library(tidyverse)
library(forecast)
duq_pc <- read.csv(file.choose(),stringsAsFactors = F)
duq_new <- duq_pc[duq_pc$Datetime >= '2013-01-01 00:00:00' & duq_pc$Datetime <= '2017-09-30 00:00:00',]
duq_train <- duq_new[duq_new$Datetime <= '2016-12-31',]
duq_test <- duq_new[duq_new$Datetime >= '2017-01-01',]
msts_power <- msts(duq_train$DUQ_MW, seasonal.periods = c(24,24*7,24*365.25), start = decimal_date(as.POSIXct("2013-01-01 00:00:00")))
#Dynamic Harmonic regression with Auto Arima
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0,
xreg=fourier(msts_power, K=c(10,10,10)))
Very interested to hear whether this is something that I would need to leave running overnight or whether other people are getting results in minutes.
In my case, running your code and measuring the times in between, it took about 40 minutes to finish. For what it's worth, I launched the script on a computer with an AMD Ryzen 2700 Eight-Core Processor 3.20 GHZ, 16 GB of RAM.
It really depends on the size of your dataset and your computer specs. You can use the tictoc library for an easy way to record time, and set trace = TRUE in your auto.arima call to output its progress to console.
library(tictoc)
tic()
fourier_power <- auto.arima(msts_power, seasonal=FALSE, lambda=0, trace = TRUE
xreg=fourier(msts_power, K=c(10,10,10)))
toc()

Accelerate for-loop containing coxph function call

Okay. So this is a similar question to one I posted before for which I still have no satisfactory solution.
As you will see below, I have used some example data to build a Cox PH model which is then passed to a custom function containing a for-loop and a function, predictSurvProb from pec, which ultimately populates a pre-allocated empty vector prediction.
I have tried using cmpfun from compiler. However, there is no improvement in performance. Am I destined to just live with this slow processing speed or are there any ways that I can speed up the processing? I cannot code in C++ so Rcpp is not an option I'd imagine.
Thanks.
require(dplyr, survival, pec)
cox_model <- coxph(Surv(time, status) ~ sex, data = lung)
surv_preds <- function(model, query) {
prediction <- vector(mode = "numeric", length = nrow(query))
time <- 30
for(i in 1:nrow(query)) {
prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
}
prediction
}
surv_preds(cox_model, lung)
After exhausting myself with attempts at C++ and parallel computing efforts, I have found that simply converting factor variables to integers has significantly improved my application. The improvement reduces the processing time to about 60 hours from almost a week!
I suspect that this is still not the most efficient solution but it will have to do for now.

makeCluster with parallelSVM in R takes up all Memory and swap

I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)

Get randomForest regression faster in R

I have to make a regression with randomforest in R. My problem is that my dataframe is huge: I have 12 variables and more than 400k entries. When I try - the code is written in the bottom - to get a randomForest regression the system takes many hours to process the data: after 5, 6 hours of calculation, I am obliged to stop the operation without any output. Someone can suggests me how I can get it faster?
Thanks
library(caret)
library(randomForest)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
I don't think to parallelize on a single PC (2-4 cores) is the answer. There are plenty of lower hanging fruits to pick.
1) RF models increase in complexity with number of training samples. The average tree depth would be something like log(480,000/5)/log(2) = 16.5 intermediary nodes. In the vast majority of examples 2000-10000 samples per tree is fine. If you competing to win on kaggle, a small extra performance really matters, as winner takes all. In practice, you probably don't need that.
2) Don't clone you data set in your R code and try to only keep one copy of your data set (pass by reference is of course fine). It's not a big problem for this data set, as the dataset is not that big (~38Mb) even for R.
3) Don't use formula interface with randomForest algorithm for large datasets. It will make an extra copy of the data set. But again memory is not that much of a problem.
4) Use a faster RF algorithm: extraTrees, ranger or Rborist are available for R. extraTrees is not exactly a RF algorithm but pretty close.
5) avoid categorical features with more than 10 categories. RF can handle up to 32, but becomes super slow as any 2^32 possible split has to be evaluated. extraTrees and Rborist handle more categories by only testing some random selected splits (which works fine). Another solution as in the python-sklearn every category are assigned a unique integer, and the feature is handled as numeric. You can convert your categorical features with as.numeric and before runing randomForest to do the same trick.
6) For much bigger data. Split the data set in random blocks and train a few(~10) trees on each. Combine forests or save forests separate. This will slightly increase the tree correlation. There are some nice cluster implementation to train like these. But won't be necessary for datasets below 1-100Gb, depending on tree complexity etc.
#below I use solution 1-3) and get a run time of some minutes
library(randomForest)
#simulate data
dataset <- data.frame(replicate(12,rnorm(400000)))
dataset$Clip_pm25 = dataset[,1]+dataset[,2]^2+dataset[,4]*dataset[,3]
#dati <- data.frame(dataset) #no need to keep the data set, an extra time in memory
#attach(dati) #if you attach dati you don't need to write data$Clip_pm25, just Clip_pm25
#but avoid formula interface for randomForest for large data sets because it cost extra memory and time
#split data in X and y manually
y = dataset$Clip_pm25
X = dataset[,names(dataset) != "Clip_pm25"]
rm(dataset);gc()
object.size(X) #38Mb, no problemo
#if you were using formula interface
#output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
#output.forest <- randomForest(dati$Clip_pm25 ~ ., ntree=250) # use dot to indicate all variables
#start small, and scale up slowly
rf = randomForest(X,y,sampsize=1000,ntree=5) #runtime ~15 seconds
print(rf) #~67% explained var
#you probably really don't need to exeed 5000-10000 samples per tree, you could grow 2000 trees to sample most of training set
rf = randomForest(X,y,sampsize=5000,ntree=500) # runtime ~5 minutes
print(rf) #~87% explained var
#regarding parallel
#here you could implement some parallel looping
#.... but is it really worth for a 2-4 x speedup?
#coding parallel on single PC is fun but rarely worth the effort
#If you work at some company or university with a descent computer cluster,
#then you can spawn the process across 20-80-200 nodes and get a ~10-60-150 x speedup
#I can recommend the BatchJobs package
Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.
For example:
library(caret)
library(randomForest)
library(doParallel)
cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )
my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, ,
data = trainSet,
method = "parRF",
ntree = 250,
trControl=my_control)
Here is a foreach implementation as well:
foreach_forest <- foreach(ntree=rep(250, cores),
.combine=combine,
.multicombine=TRUE,
.packages="randomForest") %dopar%
randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1,
data = trainSet, ntree=ntree)
# don't forget to stop the cluster
stopCluster(cl)
Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.
The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

Resources