I have a strange case where using foreach with a serial and parallel backend gives different results the first time I call but then later on both results matches. I used RNG to make the results reproducible for the same seed
Below is a sample function to explain the scenario :
func <- function(ncores = NULL, seed = 1234){
if (!is.null(ncores)){ # this block registers for parallel backend
cl <- makeCluster(ncores)
registerDoParallel(cl)
registerDoRNG(seed, once = TRUE)
on.exit(stopCluster(cl))
} else { # this block registers for serial computation
registerDoSEQ()
registerDoRNG(seed, once = TRUE)
}
w = foreach(i = 1:10, .combine = 'c') %dorng% {
mean(sample(1:100, 50, replace = TRUE))
}
attr(w, "rng") <- NULL
return(w)
}
# first time running below 2 lines
# case 1 : serial
w1 <- func(ncores = NULL)
# Case 2 : parallel
w2 <- func(ncores= 5)
identical(w1, w2)
# second time running below 2 lines
# case 1: serial
w3 <- func(ncores = NULL)
# case 2: parallel
w4 <- func(ncores= 5)
identical(w1, w2)
# [1] FALSE
identical(w3, w4)
# [1] TRUE
Am i missing something while registering sequential process ?
The solution is to use the following expression:
w = foreach(i = 1:10, .combine = 'c', .options.RNG=seed) %dorng% {
mean(sample(1:100, 50, replace = TRUE))}
You can find an explanation at the vignette here.
So your function looks like this:
func <- function(ncores = NULL, seed = 1234){
if (!is.null(ncores)){ # this block registers for parallel backend
cl <- makeCluster(ncores)
registerDoParallel(cl)
on.exit(stopCluster(cl))
} else { # this block registers for serial computation
registerDoSEQ()
}
w = foreach(i = 1:10, .combine = 'c', .options.RNG=seed) %dorng% {
mean(sample(1:100, 50, replace = TRUE))
}
attr(w, "rng") <- NULL
return(w)
}
Related
I am parallelizing a task in R using foreach loop with reproducible results using the dorng operator. It is a complex code, and I have an error that I have not been able to identify even though I have run the same code with a regular for loop.
My fundamental question is: how do I debug a function within of a foreach loop assuming that I have reproducible results? Below is my current tentative.
In the vignette of the doRNG package, it says that a sequence of random seeds will be generated and set at the beginning of each iteration using the R number generator "L’Ecuyer-CMRG". The sequence of random seeds can be defined using set.seed before the foreach loop:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
set.seed(1234)
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234)
f02 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r02 <- attr(f02, "rng")
The objects r01 and r02 contain the sequence of seeds used in f01 and f02,
identical(f01, f02)
identical(r01, r02)
such that the results from the foreach loop and their seeds are identical as expected!
Then, let's consider the case when the foreach loop will give me a random error:
set.seed(1234)
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
Error in { : task 67 failed - "non-numeric argument to binary operator"
The error occurs at iteration 67 and it is very easy to understand the error. Unfortunately, it is not the same in my case.
I would like to be able to use debug and walk through my function to understand the error. From the best of my knowlegde, I cannot use debug inside a foreach loop in R.
Then, I thought about capturing the error in a regular for loop, but running my code is very slow and, apparently, I cannot observe the error with a low number of iterations. I need to understand the error using foreach.
Although I can't recover the sequence of seeds from f03, I know that they will be identical to r01 or r02. For iteration 67, I have
r01[[67]]
[1] 10407 1484283582 -741709185 513087691 132931819
[6] 1318506528 -1383054295
Therefore, I guess that fixing my seed at r01[[67]] should give me the same error:
i <- 67
set.seed(r01[[67]], kind = "L'Ecuyer-CMRG")
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
u
[1] 74
which is not true.
In the doRNG vignette, page 6, they have an example of using a seed in a loop from a previous loop:
set.seed(1234)
ex01 <- foreach(i=1:5) %dorng% { runif(3) }
ex02 <- foreach(i=1:5, .options.RNG=attr(ex01, 'rng')[[2]]) %dorng% { runif(3) }
identical(ex02[1:4], ex01[2:5])
What am I missing?
I think the issue comes from the fact that doRNG gives you the random seed state and you're using that as the input of set.seed, which requires just an integer. I expect set.seed is only taking the first integer provided to the function to set the random seed state. Instead, what you should do is set the seed state in R. First I get the seed states and verify I can reproduce the error:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
kind <- RNGkind()
kind
[1] "Mersenne-Twister" "Inversion" "Rejection"
set.seed(1234, kind = kind[1]) #explicitly set RNG kind so I can reproduce more easily if I run the code multiple times
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234, kind = kind[1])
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
Error in { : task 39 failed - "non-numeric argument to binary operator"
I got a different iteration number for the failure. Not sure why but I still get the error as you did. Then I see if we can replicate by setting .Random.seed directly:
i <- 39
RNGkind("L'Ecuyer-CMRG")
.Random.seed <- r01[[39]]
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
Error in "a" + "b" : non-numeric argument to binary operator
u
[1] 39
Looks like we can! Full code below:
library(doRNG)
library(doParallel)
library(foreach)
registerDoParallel(2)
kind <- RNGkind()
kind
set.seed(1234, kind = kind[1]) #explicitly set RNG kind so I can reproduce more easily if I run the code multiple times
f01 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r01 <- attr(f01, "rng")
set.seed(1234, kind = kind[1])
f02 <- foreach(i = 1:100, .combine = 'c') %dorng% {
out <- 1 + i
}
r02 <- attr(f02, "rng")
identical(f01, f02)
identical(r01, r02)
set.seed(1234, kind = kind[1])
f03 <- foreach(i = 1:100, .combine = 'c') %dorng% {
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
}
identical(r01, r03)
i <- 39
RNGkind("L'Ecuyer-CMRG")
.Random.seed <- r01[[39]]
u <- floor(runif(1, 1, 101))
if (i == as.integer(u)){
out <- "a" + "b"
} else {
out <- 1 + i
}
u
#restore RNG kind to original
RNGkind(kind[1])
I'm new to paralleling the for loop using foreach and struggle to understand how it works. As an example for the exercise, I created a simple list (input2) based on a dataframe (input). I try to calculate b by looping through h and j.
library(doParallel)
library(foreach)
library(dplyr)
input <- data.frame(matrix(rnorm(200*200, 0, .5), ncol=200))
input[input <=0] =0
input['X201'] <- seq(from = 0, to = 20, length.out = 10)
input <- input %>% select(c(X201, 1:200))
input2 <- split(input, f= input$X201)
a = 0
b= 0
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm1 <- system.time(
y <-
foreach (h = length(input2),.combine = 'cbind') %:%
foreach (j = nrow(input2[[h]]),.combine = 'c',packages = 'foreach') %dopar%{
a = input2[[h]][j,3]
b = b + a
}
)
parallel::stopCluster(cl)
registerDoSEQ()
print("Cluster stopped.")
y is about 0.55 (the exact value depends on the random number one generated), which is the value of input2[[10]][20,3], not the accumulative value I desired. I checked the manual of the foreach package but still not sure I fully understand the mechanism of the foreach function.
R foreach returns back results instead allows the outside variable to be changed. So don't expect a, b to be updated correctly.
Try the following
cl <- parallel::makeCluster(20)
doParallel::registerDoParallel(cl)
tm2 <- system.time(
results <- foreach(h = (1:length(input2)), .combine = "c") %dopar%{
sum( input2[[h]][1:nrow(input2[[h]]),3])
},
b <- sum(results[1:length(results)])
)
parallel::stopCluster(cl)
registerDoSEQ()
b
tm2
I wonder I can use parallel computing in JAGS as I want.
Here is my R script.
library(foreach)
list.data2 <- foreach(i=1:n.rep) %do% {
foreach(j=1:2) %do% {list( cap = cap_data[[i]][[j]],
loc = loc_data[[i]][[j]],
eff = eff_data[[i]][[j]],
trap.numb = trap.numb2,
av = av,
forest = env$forest,
crop = env$crop,
bamboo = env$bamboo,
grass = env$grass,
abandoned = env$abandoned,
city = env$city,
rate = env$for_cr_rate,
m.numb = m.numb,
ones = matrix( 1, m.numb, 5 )
) #,bound_mat=bound_mat,bound_numb=bound_numb
}
}
inits2 <- foreach(j=1:2) %do% {list( n=n.inits2[[j]],
b0=0.5, b1=0.1, b2=0.1, b3=0.1, b4=0.1, b5=0.1, b6=0.1,
a0=5, a1=0.5, a2=0.5, a3=0.5, a4=0.5, a5=0.5, a6=0.5,
sd=1,
err=rep(0,m.numb),
r_capt=0.10
)
}
para2 <- c("a0","a1","a2","a3","a4", "a5","a6",
"b0","b1","b2","b3","b4", "b5","b6", "n28", "n29", "r_capt")
library(R2jags)
start.time <- Sys.time()
install.packages("doParallel")
library(doParallel)
registerDoParallel(cores=6)
x_real2 <- foreach( i = 1:2,
.packages = "R2jags"
) %dopar% {jags( "realdata_5years.txt",
data = list.data2[[i]][[?]],
inits = inits2[[i]],
para = para2,
n.chain = 3,
n.iter = n.1000000,
n.burnin = 400000,
n.thin = 200
)
}
sum_real2 <- foreach(i = 1:2) %do% {x_real2[[i]]$BUGSoutput$summary}
---------------------------------------------------------------------
So, I have two data sets and each has 30 ( == n.rep ) times repetition.
Therefore I have 60 data lists in total.
I would like to use six cores for both 2 data sets and each 3 MCMC chains.
Moreover, I need to repeat this calculation 30 ( == n.rep ) times.
However, I have no idea to write in this way. I have problems in the last 4 lines.
Should I use %dopar% twice?
or
Should I use jags.parallel in addition to the foreach?
A part of the code is
sse <-c()
k <- c()
for (i in seq(3, 15, 1)) {
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainLabels, k=i)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i] <- sum((newdata.trainLabels-pred_y)^2)
k[i] <- i
}
pred_y is a column for each i. I want to create a data frame with all the 13 columns. Can it be done by using a for loop? Or else how can this be accomplished? I need suggestions.
You can use foreach which has the added advantage that it can be run in parallel if you have multiple cores in your CPU. Here is the non-parallel code:
library("iterators")
library("foreach")
library("FNN")
data(iris3)
newdata.training <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
newdata.test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
newdata.trainlabels <- factor(c(rep(1,25), rep(2,25), rep(3,25)))
k.values = seq(3, 15, 1)
start = 2 # to index sse array using k.values
sse = numeric(length = length(k.values))
results = foreach(i = iter(k.values),.combine = cbind) %do%
{
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainlabels, k=i, prob = TRUE)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i - start] <- sum((as.numeric(newdata.trainlabels)-pred_y)^2)
pred_y
}
results1 = data.frame(results)
colnames(results1) = k.values
Here is the parallel version:
# Parallel version
library("iterators")
library("foreach")
library("parallel")
library("doParallel")
library("FNN")
data(iris3)
newdata.training <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
newdata.test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
newdata.trainlabels <- factor(c(rep(1,25), rep(2,25), rep(3,25)))
num.cores = detectCores()
clusters <- makeCluster(num.cores)
registerDoParallel(clusters)
k.values = seq(3, 15, 1)
start = 2 # to index sse array using k.values
sse = numeric(length = length(k.values))
results = foreach(i = iter(k.values),.combine = cbind, .packages=c("FNN")) %dopar%
{
y_pred <-knn(train = newdata.training, test = newdata.test,
cl = newdata.trainlabels, k=i, prob = TRUE)
pred_y <- as.numeric(levels(y_pred)[y_pred])
sse[i - start] <- sum((as.numeric(newdata.trainlabels)-pred_y)^2)
pred_y
}
results1 = data.frame(results)
colnames(results1) = k.values
stopCluster(clusters)
There are only a few differences between the non-parallel code and the parallel code. First, there are additional libraries to load. Second, you need to create and register the clusters that will do the parallel computation (and stop the clusters when you are done). Third, foreach uses %dopar% infix operator instead of %do%. Fourth, the foreach function needs the .packages parameter to pass KNN to each of the clusters.
I'm trying to use doMC with foreach and %dopar%. Here is the function:
doTheMath_MC <- function(st, end, nd) {
print(getDoParWorkers())
if (st > end) stop("end must be larger than st")
# Helper function from stackoverflow.com/a/23158178/633251
tr <- function(x, prec = 0) trunc(x * 10^prec) / 10^prec
# Function to use with foreach
fef <- function(i, j, num, trpi) {
if (num[j] >= num[i]) return(NULL)
val <- num[i]/num[j]
if (!tr(val, nd) == trpi) return(NULL)
return(c(i, j, tr(val, nd)))
}
# Here we go...
nd <- nd - 1
trpi <- tr(pi, nd)
num <- st:end
ni <- length(num)
ans <- foreach(i = 1:ni, .combine = rbind) %:%
foreach(j = 1:ni, .combine = rbind) %dopar% {
fef(i, j, num, trpi)
}
cat("Done computing", paste("EST", st, end, nd+1, sep = "_"), "\n")
if (is.null(ans)) return(NULL)
ans <- as.matrix(na.omit(ans)) # probably not needed in MC version
return(ans) # c("num", "den", "est", "eff")
}
I've previously set up the cores and another function calls the function above (this info posted below, I don't think it is the problem). getDoParWorkers() reports that 7 cores have been assigned as intended. The cat statement verifies that the 2 'loops' are working correctly as far as output goes. However, only 1 core is used. Anyone see why? Mac OSX 10.10.2 and R 3.2 (2015-03-15 r67992). Finally, using doParallel to control everything gives the same result.
The steps which set up everything:
mn <- 1
mx <- 10000
jmp <- 1000
mc <- TRUE
if (mc) {
require("doMC")
registerDoMC(7)
}
st <- seq(mn -1, mx - jmp, jmp) + 1
end <- seq(mn - 1 + jmp, mx, jmp)
nd <- rep(1:15, each = mx/jmp) # watch the recycling
df <- data.frame(st = st, end = end, nd = nd)
for (i in 1:nrow(df)) {
findEsts(df$st[i], df$end[i], df$nd[i], MC = mc)
}
Sorry to answer my own question! I changed the dopar handling so that only the outer loop is made parallel:
ans <- foreach(i = 1:ni, .combine = rbind) %dopar%
for (j in 1:ni) {
fef(i, j, num, trpi)
}
And, I was simply not asking for enough iterations. For testing, I had been using mx = 10000 and jmp = 1000 (see original question). These were not large enough to trigger parallel processing apparently. Increasing each 10x was necessary to get parallel processing going. Thanks to the commenters!
NOTE: While the code above activates the parallel processing, it does not return the answer correctly. That will be the subject of another question.