I am parallelizing a task in R using foreach loop with reproducible results using the dorng operator. It is a complex code, and I have an error that I have not been able to identify even though I have run the same code with a regular for loop.
My fundamental question is: how do I debug a function within of a foreach loop assuming that I have reproducible results? Below is my current tentative.
In the vignette of the doRNG package, it says that a sequence of random seeds will be generated and set at the beginning of each iteration using the R number generator "L’Ecuyer-CMRG". The sequence of random seeds can be defined using set.seed before the foreach loop:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
set.seed(1234)
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234)
f02 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r02 <- attr(f02, "rng")
The objects r01 and r02 contain the sequence of seeds used in f01 and f02,
identical(f01, f02)
identical(r01, r02)
such that the results from the foreach loop and their seeds are identical as expected!
Then, let's consider the case when the foreach loop will give me a random error:
set.seed(1234)
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
Error in { : task 67 failed - "non-numeric argument to binary operator"
The error occurs at iteration 67 and it is very easy to understand the error. Unfortunately, it is not the same in my case.
I would like to be able to use debug and walk through my function to understand the error. From the best of my knowlegde, I cannot use debug inside a foreach loop in R.
Then, I thought about capturing the error in a regular for loop, but running my code is very slow and, apparently, I cannot observe the error with a low number of iterations. I need to understand the error using foreach.
Although I can't recover the sequence of seeds from f03, I know that they will be identical to r01 or r02. For iteration 67, I have
r01[[67]]
[1] 10407 1484283582 -741709185 513087691 132931819
[6] 1318506528 -1383054295
Therefore, I guess that fixing my seed at r01[[67]] should give me the same error:
i <- 67
set.seed(r01[[67]], kind = "L'Ecuyer-CMRG")
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
u
[1] 74
which is not true.
In the doRNG vignette, page 6, they have an example of using a seed in a loop from a previous loop:
set.seed(1234)
ex01 <- foreach(i=1:5) %dorng% { runif(3) }
ex02 <- foreach(i=1:5, .options.RNG=attr(ex01, 'rng')[[2]]) %dorng% { runif(3) }
identical(ex02[1:4], ex01[2:5])
What am I missing?
I think the issue comes from the fact that doRNG gives you the random seed state and you're using that as the input of set.seed, which requires just an integer. I expect set.seed is only taking the first integer provided to the function to set the random seed state. Instead, what you should do is set the seed state in R. First I get the seed states and verify I can reproduce the error:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
kind <- RNGkind()
kind
[1] "Mersenne-Twister" "Inversion" "Rejection"
set.seed(1234, kind = kind[1]) #explicitly set RNG kind so I can reproduce more easily if I run the code multiple times
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234, kind = kind[1])
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
Error in { : task 39 failed - "non-numeric argument to binary operator"
I got a different iteration number for the failure. Not sure why but I still get the error as you did. Then I see if we can replicate by setting .Random.seed directly:
i <- 39
RNGkind("L'Ecuyer-CMRG")
.Random.seed <- r01[[39]]
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
Error in "a" + "b" : non-numeric argument to binary operator
u
[1] 39
Looks like we can! Full code below:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
kind <- RNGkind()
kind
set.seed(1234, kind = kind[1]) #explicitly set RNG kind so I can reproduce more easily if I run the code multiple times
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234, kind = kind[1])
f02 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r02 <- attr(f02, "rng")
identical(f01, f02)
identical(r01, r02)
set.seed(1234, kind = kind[1])
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
identical(r01, r03)
i <- 39
RNGkind("L'Ecuyer-CMRG")
.Random.seed <- r01[[39]]
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
u
#restore RNG kind to original
RNGkind(kind[1])
Related
I'm new to paralleling the for loop using foreach and struggle to understand how it works. As an example for the exercise, I created a simple list (input2) based on a dataframe (input). I try to calculate b by looping through h and j.
library(doParallel)
library(foreach)
library(dplyr)
input <- data.frame(matrix(rnorm(200*200, 0, .5), ncol=200))
input[input <=0] =0
input['X201'] <- seq(from = 0, to = 20, length.out = 10)
input <- input %>% select(c(X201, 1:200))
input2 <- split(input, f= input$X201)
a = 0
b= 0
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm1 <- system.time(
y <-
foreach (h = length(input2),.combine = 'cbind') %:%
foreach (j = nrow(input2[[h]]),.combine = 'c',packages = 'foreach') %dopar%{
a = input2[[h]][j,3]
b = b + a
}
)
parallel::stopCluster(cl)
registerDoSEQ()
print("Cluster stopped.")
y is about 0.55 (the exact value depends on the random number one generated), which is the value of input2[[10]][20,3], not the accumulative value I desired. I checked the manual of the foreach package but still not sure I fully understand the mechanism of the foreach function.
R foreach returns back results instead allows the outside variable to be changed. So don't expect a, b to be updated correctly.
Try the following
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm2 <- system.time(
results <- foreach(h = (1:length(input2)), .combine = "c") %dopar%{
sum( input2[[h]][1:nrow(input2[[h]]),3])
},
b <- sum(results[1:length(results)])
)
parallel::stopCluster(cl)
registerDoSEQ()
b
tm2
I need to stop the parallel loop with a condition, e.g., when res < 1. A MWE is given by code below
library(foreach)
library(doParallel)
I <- 1000
L <- 1000
res <- Inf
cores <- detectCores()
cluster <- makeCluster(cores)
registerDoParallel(cluster)
out <- foreach(l = 1:I,.packages = "cec2013") %dopar% {
for(i in 1:I){
res <- 100/i
}
out <- res
out
}
out
My tentative to solve is given by
out <- foreach(l = 1:I,.packages = "cec2013")%:%when(res < 1) %dopar% {
for(i in 1:I){
res <- 100/i
}
out <- res
out
}
out
But out return a void list.
The reason this happens is because when(res < 1) looks in the calling environment, not the environment inside the loop. So you defined res <- Inf which is never less than 1.
Also, your foreach loop always returns 0.1, since the inner loop always stops at I=1000, so res will be 0.1. It's unclear what you're trying to do or when do you expect the loop to stop?
I have set up a Metropolis-Hastings algorithm, and now I am trying to run the algorithm using parallel computing. I have set up a single-chain function
library(parallel)
library(foreach)
library(mvtnorm)
library(doParallel)
n<-100
mX <- 1:n
vY <- rnorm(n)
chains <- 4
iter <- n
p <- 2
#Loglikelihood
post <- function(y, theta) dmvnorm(t(y), rep(0,length(y)), theta[1]*exp(- abs(matrix(rep(mX,n),n) - matrix(rep(mX,each=n),n))/theta[2]),log=TRUE)
geninits <- function() list(theta = runif(p, 0, 1))
dist <- 0.01
jump <- function(x, dist) exp(log(x) + rmvnorm(1,rep(0,p),diag(rep(dist,p))))
MCsingle <- function(){ # This is part of a larger function, so no input are needed
inits <- geninits()
theta.post <- matrix(NA,nrow=p,ncol=iter)
for (i in 1:p) theta.post[i,1] <- inits$theta[i]
for (t in 2:iter){
theta_star <- c(jump(theta.post[, t-1],dist))
pstar <- post(vY, theta = theta_star) # post is the loglikelihood using dmvnorm.
pprev <- post(vY, theta = theta.post[,t-1])
r <- min(exp(pstar - pprev) , 1)
accept <- rbinom(1, 1, prob = r)
if (accept == 1){
theta.post[, t] <- theta_star
} else {
theta.post[, t] <- theta.post[, t-1]
}
}
return(theta.post)
}
, which returns an p x iter matrix, with p parameters and iter iterations.
cl<-makeCluster(4)
registerDoParallel(cl)
posterior <- foreach(c = 1:chains) %dopar% {
MCsingle() }
UPDATE: When I tried to simplify the problem the code suddenly seemed to work. Even though I purposely tried to make errors, the code ran perfectly and the results were as wanted. So for others with similar problems unfortunately I cannot give an answer.
A follow-up question:
My initial purpose was to built up an entire function, such that
MCmulti <- function(mX,vY,iter,chains){
posterior <- foreach(c = 1:chains) %dopar% {
MCsingle() }
return(posterior)
}
but the foreach-loop does not seem to read all the required functions like:
Error in FUN() : task 1 failed - "could not find function "geninits""
Can anybody answer how to implement custom functions inside a foreach loop? Am I to input it as MCmulti <- function(FUN,...) FUN() and call MCmulti(MCsingle,...) ?
I'm trying to use doMC with foreach and %dopar%. Here is the function:
doTheMath_MC <- function(st, end, nd) {
print(getDoParWorkers())
if (st > end) stop("end must be larger than st")
# Helper function from stackoverflow.com/a/23158178/633251
tr <- function(x, prec = 0) trunc(x * 10^prec) / 10^prec
# Function to use with foreach
fef <- function(i, j, num, trpi) {
if (num[j] >= num[i]) return(NULL)
val <- num[i]/num[j]
if (!tr(val, nd) == trpi) return(NULL)
return(c(i, j, tr(val, nd)))
}
# Here we go...
nd <- nd - 1
trpi <- tr(pi, nd)
num <- st:end
ni <- length(num)
ans <- foreach(i = 1:ni, .combine = rbind) %:%
foreach(j = 1:ni, .combine = rbind) %dopar% {
fef(i, j, num, trpi)
}
cat("Done computing", paste("EST", st, end, nd+1, sep = "_"), "\n")
if (is.null(ans)) return(NULL)
ans <- as.matrix(na.omit(ans)) # probably not needed in MC version
return(ans) # c("num", "den", "est", "eff")
}
I've previously set up the cores and another function calls the function above (this info posted below, I don't think it is the problem). getDoParWorkers() reports that 7 cores have been assigned as intended. The cat statement verifies that the 2 'loops' are working correctly as far as output goes. However, only 1 core is used. Anyone see why? Mac OSX 10.10.2 and R 3.2 (2015-03-15 r67992). Finally, using doParallel to control everything gives the same result.
The steps which set up everything:
mn <- 1
mx <- 10000
jmp <- 1000
mc <- TRUE
if (mc) {
require("doMC")
registerDoMC(7)
}
st <- seq(mn -1, mx - jmp, jmp) + 1
end <- seq(mn - 1 + jmp, mx, jmp)
nd <- rep(1:15, each = mx/jmp) # watch the recycling
df <- data.frame(st = st, end = end, nd = nd)
for (i in 1:nrow(df)) {
findEsts(df$st[i], df$end[i], df$nd[i], MC = mc)
}
Sorry to answer my own question! I changed the dopar handling so that only the outer loop is made parallel:
ans <- foreach(i = 1:ni, .combine = rbind) %dopar%
for (j in 1:ni) {
fef(i, j, num, trpi)
}
And, I was simply not asking for enough iterations. For testing, I had been using mx = 10000 and jmp = 1000 (see original question). These were not large enough to trigger parallel processing apparently. Increasing each 10x was necessary to get parallel processing going. Thanks to the commenters!
NOTE: While the code above activates the parallel processing, it does not return the answer correctly. That will be the subject of another question.
%dopar% forks the main R process into several independent sub-processes. Is there a way to make these sub-processes communicate with the main R process, so that data can be 'recovered' ?
require(foreach)
require(doMC)
registerDoMC()
options(cores = 2 )
a <- c(0,0)
foreach(i = 1:2 ) %do% {
a[i] <- i
}
print(a) # returns 1 2
a <- c(0,0)
foreach(i = 1:2 ) %dopar% {
a[i] <- i
}
print(a) # returns 0 0
Thanks!
You should read the foreach documentation:
The foreach and %do%/%dopar% operators provide a looping construct
that can be viewed as a hybrid of the standard for loop and lapply
function. It looks similar to the for loop, and it evaluates an
expression, rather than a function (as in lapply), but it's purpose is
to return a value (a list, by default), rather than to cause
side-effects.
Try this:
a <- foreach(i = 1:2 ) %dopar% {
i
}
print(unlist(a))
If you want your result to be a dataframe, you could do:
library(data.table)
result <- foreach(i = 1:2) %dopar% {
i
}
result.df <- rbindlist(Map(as.data.frame, result))
Thanks to Karl, I now understand the purpose of '.combine'
a <- foreach(i = 1:2 , .combine=c) %dopar% {
return(i)
}
print(a) # returns 1 2