mclapply long vectors not supported yet - r

I'm trying to run some R code and it is crashing because of memory. The error that I get is:
Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) :
long vectors not supported yet: memory.c:3100
The function that creates the troubles is the following:
StationUserX <- function(userNDX){
lat1 = deg2rad(geolocation$latitude[userNDX])
long1 = deg2rad(geolocation$longitude[userNDX])
session_user_id = as.character(geolocation$session_user_id[userNDX])
#Find closest station
Distance2Stations <- unlist(lapply(stationNDXs, Distance2StationX, lat1, long1))
# Return index for closest station and distance to closest station
stations_userX = data.frame(session_user_id = session_user_id,
station = ghcndstations$ID[stationNDXs],
Distance2Station = Distance2Stations)
stations_userX = stations_userX[with(stations_userX, order(Distance2Station)), ]
stations_userX = stations_userX[1:100,] #only the 100 closest stations...
row.names(stations_userX)<-NULL
return(stations_userX)
}
I run this function using mclapply 50k times. StationUserX is calling Distance2StationX 90k times.
Is there an obvious way to optimize the function StationUserX ?

mclapply is having trouble sending back all the data from worker threads into the main thread. That's because of prescheduling, where it runs large number of iterations per thread, and then syncs all the data back. That's nice and fast, but results in >2GB of data being sent back, which it can't do.
Run mclapply with mc.preschedule=F to turn off pre-scheduling. Now, each iteration will spawn its own thread and will return its own data. It won't go quite as fast, but it gets around the problem.

Try using nextElem() from the iterators package. It acts like a "generator" in Python, so you don't have to load the entire list into memory.

Related

Parallel recursive function in R?

I’ve been wracking my brain around this problem all week and could really use an outside perspective. Basically I’ve built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer. I’ve generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls. The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.
Now I'd like to parallelize this function. However, when I run it with mclapply and mc.cores=2, the runtime increases! The same happens when I run it on a multicore cluster with mc.cores=12. What’s going on here? Are the parent nodes waiting for the child nodes to return some output? Does this have something to do with fork/socket parallelization?
For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection. I’m a biologist and self-taught programmer so I’m a little out of my depth here - any help or leads would be really appreciated!
# Load libraries.
library(data.table)
library(parallel)
# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {
# Terminate on fifth loop.
message(paste("Started loop", loop))
if(loop == 5) {return(TRUE)}
# Create large matrix and do some operation.
bigmx <- matrix(rnorm(10), 50000, 250)
bigmx <- sin(bigmx^2)
# Aggregate matrix and save output.
agg <- colMeans(bigmx)
append <- file.exists("output.csv")
fwrite(t(agg), file = "output.csv", append = append, row.names = F)
# Split matrix in submatrices with 100 columns each.
ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)
lapply(unique(ind), function(i) {
submx <- bigmx[, ind == i]
# Pass each submatrix to subsequent call.
loop <- loop + 1
tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)
})
}
# Initiate tree search.
tree_search()
After a lot more brain wracking and experimentation, I ended up answering my own question. I’m not going to refer to the original example since I've changed up my approach quite a bit. Instead I’ll share some general observations that might help people in similar situations.
1.) For loops are more memory efficient than lapply and recursive functions
When you use lapply, each call creates a copy of your current environment. That’s why you can do this:
x <- 5
lapply(1:10, function(i) {
x <- x + 1
x == 6 # TRUE
})
x == 5 # ALSO TRUE
At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x. That’s not good if, say, x was actually a large dataframe with 10,000 variables. for loops, on the other hand, allow you to override the variables on each loop.
x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE
2.) Parallelize once
Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script. Therefore, you should use mclapply with discretion. In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times. Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.
3.) You can use mclapply to throttle memory usage
If you split a job into 10 parts and process them with mclapply and mc.preschedule=F, each core will only process 10% of your job at a time. If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one. This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.
Final Note
This is one of the more interesting problems I’ve worked on so far. However, recursive tree functions are complicated. Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.

Timing R Chunk in R MarkDown/Notebook (Equivalent of %%timeit in python)

I am looking for an elegant way of timing execution of R chunk preferably running the chunk multiple times automatically in background. (Magic function %%timeit in Python notebook does exactly that)
I know there are several ways of timing an R function or bunch of R code and there are few SO questions on that as well. All the methods are described in this article too.
However, most of them do not replicate the r code and ones which have option to replicate (like system.time or mircobenchbark) are ideal for using on a function but not on a chunk of code. (or may be I do not understand it right)
tictoc works pretty well for me except it will give the run time for only single execution but does not have option to run like 1000 times and averaging the run time. (again what %%timeit does)
With tictoc it is possible to record the timing in a loop and average the results.
library(tictoc)
tic.clearlog()
for (x in 1:10) {
# passing x to tic() makes its value to become a label
# at time of the matching toc() call.
tic(x)
Sys.sleep(1)
# When log = TRUE, toc() pushes the measured timing to a list
# quiet = TRUE prevents from printing the timing
toc(log = TRUE, quiet = TRUE)
}
Fetch the results of toc() as formatted text for printing.
log.txt <- tic.log(format = TRUE)
Extract the list containing measurements in raw format.
log.lst <- tic.log(format = FALSE)
Since the data is already extracted, clear the tictoc log.
tic.clearlog()
Convert the list elements to timings.
Each element of the list has a start (tic) and end (toc) timestamp.
timings <- unlist(lapply(log.lst, function(x) x$toc - x$tic))
Compute the average loop time.
mean(timings)
# [1] 1.001

r: combining foreach results

I am running a parallel process in a foreach loop that returns a 7x30 matrix at the end of each loop. When I run the loop using this command, it finishes in 11.5 minutes:
myData<-foreach(i=1:270000, .packages='quadprog')%dopar%{
Unfortunately, myData is a list and I want to plot the last two columns of every matrix within that list. So, I use this command to convert it to a data frame for ggplot2:
myData<-Reduce(rbind.data.frame, myData[1:length(myData)])
This command works well for a small myData but myData is 270,000 matrices long. It is either hanging up or taking a really long time to convert.
So, I try to run the loop using this command so that the output is a data frame in the first place:
myData<-foreach(i=1:270000, .combine=rbind.data.frame, .packages='quadprog')%dopar%{
This has been running for the last two hours (way longer than 11 minutes).
Is there a way to efficiently get the output from these loops and put it into a format where I can graph it?
Interestingly, when I look at the Windows Task Manager, the first call to the loop immediately sends all of my CPU core usage to 100%. The second one is closer to 10% even though I setup the same number of clusters under doSnow.
use rbindlist instead of rbind to combine your results and .multicombine to speed it up
comb = function(...)rbindlist(...)
myData = foreach(i = 1:27000 , .combine = comb ,.packages='quadprog',.multicombine = TRUE) %dopar% {

foreach (foreach package) for parallel processing in R

I am calculating permutation test statistic using a for loop. I wish to speed this up using parallel processing (in particular, foreach in foreach package). I am following the instructions from:
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/
My original code:
library(foreach)
library(doParallel)
set.seed(10)
x = rnorm(1000)
y = rnorm(1000)
n = length(x)
nexp = 10000
perm.stat1 = numeric(n)
ptm = proc.time()
for (i in 1:nexp){
y = sample(y)
perm.stat1[i] = cor(x,y,method = "pearson")
}
proc.time()-ptm
# 1.321 seconds
However, when I used the foreach loop, I got the result much slower:
cl<-makeCluster(8)
registerDoParallel(cl)
perm.stat2 = numeric(n)
ptm = proc.time()
perm.stat2 = foreach(icount(nexp), .combine=c) %dopar% {
y = sample(y)
cor(x,y,method = "pearson")
}
proc.time()-ptm
stopCluster(cl)
#3.884 seconds
Why is this happening? What did I do wrong?
Thanks
You're getting bad performance because you're splitting up a small problem into 10,000 tasks, each of which takes about an eighth of a millisecond to execute. It's alright to simply turn a for loop into a foreach loop when the body of the loop takes a significant period of time (I used to say at least 10 seconds, but I've dropped that to at least a second nowadays), but that simple strategy doesn't work when the tasks are very small (in this case, extremely small). When the tasks are small you spend most of your time sending the tasks and receiving the results from workers. In other words, the communication overhead is greater than the computation time. Frankly, I'm amazed that you didn't get much worse performance.
To me, it doesn't really seem worthwhile to parallelize a problem that takes less than two seconds to execute, but you can actually get a speed up using foreach by chunking. That is, you split the problem into smaller chunks, usually giving one chunk to each worker. Here's an example:
nw <- getDoParWorkers()
perm.stat1 <-
foreach(xnexp=idiv(nexp, chunks=nw), .combine=c) %dopar% {
p = numeric(xnexp)
for (i in 1:xnexp) {
y = sample(y)
p[i] = cor(x,y,method="pearson")
}
p
}
As you can see, the foreach loop is splitting the problem into chunks, and the body of that loop contains a modified version of the original sequential code, now operating on a fraction of the entire problem.
On my four core Mac laptop, this executes in 0.447 seconds, compared to 1.245 seconds for the sequential version. That seems like a very respectable speed up to me.
There's a lot more computational overhead in the foreach loop. This returns a list containing each execution of the loop's body that is then combined into a vector via the .combine=c argument. The for loop does not return anything, instead assigning a value to perm.stat1 as a side effect, so does not need any extra overhead.
Have a look at Why is foreach() %do% sometimes slower than for? for a more in-depth explaination of why foreach is slower than for in many cases. Where foreach comes into its own is when the operations inside the loop are computationally intensive, making the time penalty associated with returning each value in a list insignificant by comparison. For example, the combination of rnorm and summary used in the Wordpress article above.

How do I time out a lapply when a list item fails or takes too long?

For several efforts I'm involved in at the moment, I am running large datasets with numerous parameter combinations through a series of functions. The functions have a wrapper (so I can mclapply) for ease of operation on a cluster. However, I run into two major challenges.
a) My parameter combinations are large (think 20k to 100k). Sometimes particular combinations will fail (e.g. survival is too high and mortality is too low so the model never converges as a hypothetical scenario). It's difficult for me to suss out ahead of time exactly which combinations will fail (life would be easier if I could do that). But for now I have this type of setup:
failsafe <- failwith(NULL, my_wrapper_function)
# This is what I run
# Note that input_variables contains a list of variables in each list item
results <- mclapply(input_variables, failsafe, mc.cores = 72)
# On my local dual core mac, I can't do this so the equivalent would be:
results <- llply(input_variables, failsafe, .progress = 'text')
The skeleton for my wrapper function looks like this:
my_wrapper_function <- function(tlist) {
run <- tryCatch(my_model(tlist$a, tlist$b, tlist$sA, tlist$Fec, m = NULL) , error=function(e) NULL)
...
return(run)
}
Is this the most efficient approach? If for some reason a particular combination of variables crashes the model, I need it to return a NULL and carry on with the rest. However, I still have issues that this fails less than gracefully.
b) Sometimes a certain combination of inputs does not crash the model but takes too long to converge. I set a limit on the computation time on my cluster (say 6 hours) so I don't waste my resources on something that is stuck. How can I include a timeout such that if a function call takes more than x time on a single list item, it should move on? Calculating the time spent is trivial but a function mid simulation can't be interrupted to check the time, right?
Any ideas, solutions or tricks are appreciated!
You may well be able to manage graceful-exits-upon-timout using a combination of tryCatch() and evalWithTimeout() from the R.utils package.
See also this post, which presents similar code and unpacks it in a bit more detail.
require(R.utils)
myFun <- function(x) {Sys.sleep(x); x^2}
## evalWithTimeout() times out evaluation after 3.1 seconds, and then
## tryCatch() handles the resulting error (of class "TimeoutException") with
## grace and aplomb.
myWrapperFunction <- function(i) {
tryCatch(expr = evalWithTimeout(myFun(i), timeout = 3.1),
TimeoutException = function(ex) "TimedOut")
}
sapply(1:5, myWrapperFunction)
# [1] "1" "4" "9" "TimedOut" "TimedOut"

Resources