Accelerate for-loop containing coxph function call - r

Okay. So this is a similar question to one I posted before for which I still have no satisfactory solution.
As you will see below, I have used some example data to build a Cox PH model which is then passed to a custom function containing a for-loop and a function, predictSurvProb from pec, which ultimately populates a pre-allocated empty vector prediction.
I have tried using cmpfun from compiler. However, there is no improvement in performance. Am I destined to just live with this slow processing speed or are there any ways that I can speed up the processing? I cannot code in C++ so Rcpp is not an option I'd imagine.
Thanks.
require(dplyr, survival, pec)
cox_model <- coxph(Surv(time, status) ~ sex, data = lung)
surv_preds <- function(model, query) {
prediction <- vector(mode = "numeric", length = nrow(query))
time <- 30
for(i in 1:nrow(query)) {
prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
}
prediction
}
surv_preds(cox_model, lung)

After exhausting myself with attempts at C++ and parallel computing efforts, I have found that simply converting factor variables to integers has significantly improved my application. The improvement reduces the processing time to about 60 hours from almost a week!
I suspect that this is still not the most efficient solution but it will have to do for now.

Related

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

'Invalid parent values' error when running JAGS from R

I am running a simple generalized linear model, calling JAGS from R. The model is negatively binomially distributed. The model is being fitted to data on counts of fish, with the majority of individual counts ('C' in the data set below) being zeros.
I initially ran the model with one covariate, temperature ('Temp'). About half of the time the model ran and the other half of the time the model gave me the error, 'Error in node C[###] Invalid parent values.' The value for C[###] in the error message changes with each successive attempt to run the model.
Since my success at running the model was inconsistent, I tried adding another covariate, salinity ('Salt'). Then the model would not run at all, with the same error message as above.
Any ideas or suggestions on the source of the error are greatly appreciated.
I am suspecting that the initial values for the dispersion parameter, r, may be the issue. Ideally I add several more covariates into model fitting if this error can be addressed.
The data set and code are immediately below. For sake of getting the data to load properly on this website, I have omitted 662 of the 672 total values; even with the reduced data set (n = 10 instead of n = 672) the problem remains.
Thank you.
setwd("C:/Users/John/Desktop")
library('coda')
library('rjags')
library('R2jags')
set.seed(1000000000)
#data
n=10
C=c(0,0,0,0,0,1,0,0,0,1)
Temp=c(0,29.3,25.3,28.7,28.7,24.4,25.1,25.1,24.2,23.3)
Salt=c(6,6,0,6,6,0,12,12,6,12)
sink("My Model.txt")
cat("
model {
r~dunif(0,10)
beta0~dunif (-20,20)
beta1~dunif (-20,20)
beta2~dunif (-20,20)
for (i in 1:n) {
C[i] ~ dnegbin(p[i], r)
p[i] <- r/(r+lambda[i])
log(lambda[i]) <- mu[i]
mu[i] <- beta0 + beta1*Temp[i] + beta2*Salt[i]
}
}
", fill=TRUE)
sink()
n=n
C=C
Temp=Temp
Salt=Salt
#bundle data
bugs.data = list(
"n",
"C",
"Temp",
"Salt")
#parameters to monitor
params<-c(
"r",
"beta0",
"beta1",
"beta2")
#initial values
inits <- function(){list(
r=floor(runif(1,0,5)),
beta0=runif(1,-5,5),
beta1=runif(1,-5,5),
beta2=runif(1,-5,5))}
model.file <- 'My Model.txt'
jagsfit <- jags(data=bugs.data, inits=inits, params, n.iter=1000, n.thin=10, n.burnin=100, model.file)
print(jagsfit, digits=5)
This works fine for me most of the time, but it would fail with the error you describe if the inits function samples a value of r of 0 - which you have made more likely by using floor() in the inits function (not sure why you did that - r is not restricted to integers but is strictly positive). Also, every time you run the model you will get different initial values (unless setting a random seed in R) which is making your life more complicated that it needs to be. I generally recommend picking fixed (and probably over dispersed) initial values, such as r=0.01 and r=10 for the two chains in your example.
However, JAGS picks usable initial values for this model as you can see by not providing your own inits e.g.:
library('runjags')
listdata <- lapply(bugs.data, get)
names(listdata) <- unlist(bugs.data)
run.jags(model.file, params, listdata)
I would also have a think about the prior you are using for r - it could well be that this will have a bigger effect on your posterior than intended. Another (not necessarily better) option is something like a gamma prior.
Matt

e1071 Package: naiveBayes prediction is slow

I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.
I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.
This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.
I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.
I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).
The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.
# Remove everything from the environment and clear out memory
rm(list = ls())
gc()
# Load required packages and datasets
require(e1071)
data(iris)
# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
# set seed for reproducibility
set.seed(12345)
#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()
#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5 # no. of locations
N <- 1e4*L
# Data
married <- 1*(runif(N,0.0,1.0)>.45)
kids <- 1*(runif(N,0.0,1.0)<.22)
birthloc <- sample(1:L,N,TRUE)
major <- 1*(runif(N,0.0,1.0)>.4)
exper <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter <- 2*runif(N,0.0,1.0)-1
occShifter <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)
# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)
# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
X$major*Gamma[2,l] + X$married*Gamma[3,l]
X$kids*Gamma[4,l] + X$exper*Gamma[5,l]
X$occShifter*Gamma[6,l] + X$migShifter*X$married*Gamma[7,l]
eps[ ,l]
}
choice <- apply(u, 1, which.max)
# Add choice to data frame
dat <- cbind(choice,X)
# factorize categorical variables for estimation
dat$major <- as.factor(dat$major)
dat$married <- as.factor(dat$married)
dat$kids <- as.factor(dat$kids)
dat$birthloc <- as.factor(dat$birthloc)
dat$choice <- as.factor(dat$choice)
tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()
I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:
https://cran.r-project.org/web/packages/fastNaiveBayes/index.html
I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!
The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.

Get randomForest regression faster in R

I have to make a regression with randomforest in R. My problem is that my dataframe is huge: I have 12 variables and more than 400k entries. When I try - the code is written in the bottom - to get a randomForest regression the system takes many hours to process the data: after 5, 6 hours of calculation, I am obliged to stop the operation without any output. Someone can suggests me how I can get it faster?
Thanks
library(caret)
library(randomForest)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
I don't think to parallelize on a single PC (2-4 cores) is the answer. There are plenty of lower hanging fruits to pick.
1) RF models increase in complexity with number of training samples. The average tree depth would be something like log(480,000/5)/log(2) = 16.5 intermediary nodes. In the vast majority of examples 2000-10000 samples per tree is fine. If you competing to win on kaggle, a small extra performance really matters, as winner takes all. In practice, you probably don't need that.
2) Don't clone you data set in your R code and try to only keep one copy of your data set (pass by reference is of course fine). It's not a big problem for this data set, as the dataset is not that big (~38Mb) even for R.
3) Don't use formula interface with randomForest algorithm for large datasets. It will make an extra copy of the data set. But again memory is not that much of a problem.
4) Use a faster RF algorithm: extraTrees, ranger or Rborist are available for R. extraTrees is not exactly a RF algorithm but pretty close.
5) avoid categorical features with more than 10 categories. RF can handle up to 32, but becomes super slow as any 2^32 possible split has to be evaluated. extraTrees and Rborist handle more categories by only testing some random selected splits (which works fine). Another solution as in the python-sklearn every category are assigned a unique integer, and the feature is handled as numeric. You can convert your categorical features with as.numeric and before runing randomForest to do the same trick.
6) For much bigger data. Split the data set in random blocks and train a few(~10) trees on each. Combine forests or save forests separate. This will slightly increase the tree correlation. There are some nice cluster implementation to train like these. But won't be necessary for datasets below 1-100Gb, depending on tree complexity etc.
#below I use solution 1-3) and get a run time of some minutes
library(randomForest)
#simulate data
dataset <- data.frame(replicate(12,rnorm(400000)))
dataset$Clip_pm25 = dataset[,1]+dataset[,2]^2+dataset[,4]*dataset[,3]
#dati <- data.frame(dataset) #no need to keep the data set, an extra time in memory
#attach(dati) #if you attach dati you don't need to write data$Clip_pm25, just Clip_pm25
#but avoid formula interface for randomForest for large data sets because it cost extra memory and time
#split data in X and y manually
y = dataset$Clip_pm25
X = dataset[,names(dataset) != "Clip_pm25"]
rm(dataset);gc()
object.size(X) #38Mb, no problemo
#if you were using formula interface
#output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
#output.forest <- randomForest(dati$Clip_pm25 ~ ., ntree=250) # use dot to indicate all variables
#start small, and scale up slowly
rf = randomForest(X,y,sampsize=1000,ntree=5) #runtime ~15 seconds
print(rf) #~67% explained var
#you probably really don't need to exeed 5000-10000 samples per tree, you could grow 2000 trees to sample most of training set
rf = randomForest(X,y,sampsize=5000,ntree=500) # runtime ~5 minutes
print(rf) #~87% explained var
#regarding parallel
#here you could implement some parallel looping
#.... but is it really worth for a 2-4 x speedup?
#coding parallel on single PC is fun but rarely worth the effort
#If you work at some company or university with a descent computer cluster,
#then you can spawn the process across 20-80-200 nodes and get a ~10-60-150 x speedup
#I can recommend the BatchJobs package
Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.
For example:
library(caret)
library(randomForest)
library(doParallel)
cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )
my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, ,
data = trainSet,
method = "parRF",
ntree = 250,
trControl=my_control)
Here is a foreach implementation as well:
foreach_forest <- foreach(ntree=rep(250, cores),
.combine=combine,
.multicombine=TRUE,
.packages="randomForest") %dopar%
randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1,
data = trainSet, ntree=ntree)
# don't forget to stop the cluster
stopCluster(cl)
Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.
The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

Running PLSR predictions parallel in R using foreach

Users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save pprocessing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values as well as the model performance parameters (RMSEP) to the output variable.
The code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns
kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls)
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
predict(pls, ncomp=5, newdata=x[j,,drop=F])
RMSEP(pls, estimate="CV")$val[1,1,5]
}
stopCluster(cl)
As I understand, the code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the .combine option would take care of this?
Many thanks for your help!
Best, Chega
If you want to return two objects from the body of a foreach loop, you need to put them into an object such as a list:
out <- foreach(j = 1:nrow(mat), .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
list(p=predict(pls, ncomp=5, newdata=x[j,,drop=F]),
r=RMSEP(pls, estimate="CV")$val[1,1,5])
}
Only the "final value" of the loop body is returned to the master and then processed by the .combine function.
Note that I removed the .combine argument so that the result will be a list of lists of length 2. It's not clear to me that rbind is the appropriate function to use to process the results.
Since this question was originally answered, the pls package has been modified to allow the cross-validation to be run in parallel. The implementation is trivially easy--simply a matter of defining either a persistent cluster, or the number of cores to use in a transient cluster, in pls.options.
If transient clusters are used, implementation literally requires only two lines of code:
library(parallel)
pls.options(parallel=NumberOfCoresToUse)
No changes to the output variables are needed.
I haven't checked whether parallelizing at the calibration level, as in the question, would be more efficient. I suspect it would be, particularly when the number of calibration iterations is much larger than the number of cross-validation steps (especially when the number of CVs isn't a multiple of the number of cores used), but this approach is so straightforward that the extra coding effort may not be worth it.

Resources