I've been trying to get a parallelized foreach loop running in R, it works fine for approximately ten iterations but then crashes, showing the error:
Error in { : task 7 failed - "missing value where TRUE/FALSE needed"
Calls: %dopar% -> <Anonymous>
Execution halted
I append the results of each loop to a file, which does show the output to be as expected. My script is as followed,using the combn_sub function from this post:
LBRA <- fread(
input = "LBRA.012",
data.table = FALSE)
str_bra <- nrow(LBRA)
br1sums <- colSums(LBRA)
b1non <- which(br1sums == 0)
LBRA_trim <- LBRA[,-b1non]
library(foreach)
library(doMC)
registerDoMC(28)
foreach(X = seq(2, (nrow(LBRA)-1))) %dopar% {
com <- combn_sub(
x = nrow(LBRA),
m = X,
nset = 1000)
out_in <- matrix(
ncol = 2,
nrow = 1)
colnames(out) <- c("SNPs", "k")
for (A in seq(1, ncol(com))){
rowselect <- com[, A]
sub <- LBRA_trim[rowselect, ]
subsum <- colSums(sub)
length <- length(which(subsum != 0)) - 1
out_in <- rbind(out_in, c(length, X))
}
write.table(
file = "plateau.csv",
sep = "\t",
x = out_in,
append = TRUE)
}
I had a similar problem with my foreach call...
tmpcol <- foreach(j = idxs:idxe, .combine=cbind) %dopar% { imp(j) }
Error in { : task 18 failed - "missing value where TRUE/FALSE needed"
Changing the .errorhandling parameter only ignores the error
tmpcol <- foreach(j = idxs:idxe, .combine=cbind, .errorhandling="pass") %dopar% { imp(j) }
Warning message:
In fun(accum, result.18) :
number of rows of result is not a multiple of vector length (arg 2)
I suggest running the function in your foreach call for X=7. The problem in my case was my function, imp(j), was throwing an error (for j=18, it was hitching on an NA calculation) which resulted in the vague output from foreach.
As #Roland points out, it's a very bad idea to write to a file within a foreach loop. Even writing in append mode, the individual cores will attempt to write to the file simultaneously and may clobber each other's input. Instead, capture the results of the foreach statement using the .combine="rbind" option and then write to file after the loop:
cluster <- makeCluster(28, outfile="MulticoreLogging.txt");
registerDoMc(cluster);
foreach_outcome_table <- foreach(X = seq(2, (nrow(LBRA)-1)), .combine="rbind") %dopar% {
print(cat(paste(Sys.info()[['nodename']], Sys.getpid(), sep='-'), "now performing loop", X, "\n"));
com <- combn_sub(x = nrow(LBRA), m = X, nset = 1000);
out_in <- matrix(ncol = 2,nrow = 1);
colnames(out_in) <- c("SNPs", "k");
for (A in seq(1, ncol(com))){
rowselect <- com[, A];
sub <- LBRA_trim[rowselect, ];
subsum <- colSums(sub);
length <- length(which(subsum != 0)) - 1;
out_in <- rbind(out_in, c(length, X));
}
out_in;
}
write.table(file = "plateau.csv",sep = "\t", x = foreach_outcome_table, append = TRUE);
Further, you could replace the inner for loop with a nested foreach loop which would probably be more efficient.
There could be many reasons for the error, "missing value where TRUE/FALSE needed".
What helped for me was to remove the %dopar% and run the same code on a single item. This revealed more/clearer error messages which, I think, get lost when running in parallel. My error had nothing to do with the %dopar% itself.
Related
I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in
I'm trying to create a data frame within the if statement , but when I use this data frame in the else scope I get the following error:
Error: $ operator is invalid for atomic vectors
Here is part of my code:
for(i in 1:numOfTrays){
if (i == 1){
parameters <- c(Qin=Qin,A=A)
state <- c(h=h0)
time <- seq(0,200,by=1)
out <- ode(y= state, func = FluidH, parms = parameters, times = time)
Qout <- cbind(out[,1],VFRoutput((out[,2])))
colnames(Qout)<-c("time","Qout")
Qin <- as.data.frame(Qout)
#write.csv(Qin,"Qin")
}
else{
for (j in 1:length(Qin$Qout)){
h <- h0 + ((Qin$Qout[j]-VFRoutput(h0))/A)*Qin$time[j]
I tried to make Qin global with the <<- operator but when I tried to print Qin$Qout from the if statement I got the same error.
Any solutions?
While your example is not reproducible, I would venture a guess that Qout exists before the if statement as a matrix or some sort (perhaps with no column names). Here's a reproducible example:
Qout <- matrix(NA, nrow = 3, ncol = 3)
chk <- TRUE
if (chk) {
out <- as.data.frame(Qout)
} else {
Qout$col1
}
Qout <- matrix(NA, nrow = 3, ncol = 3)
chk <- FALSE
if (chk) {
out <- as.data.frame(Qout)
} else {
Qout$col1
}
Error in Qout$col1 : $ operator is invalid for atomic vectors
So you need to make sure that in else statement, you're dealing with an object that supports $ subsetting - like a data.frame.
I am writing a function to do backward stepwise feature selection. I use foreach as inside loop, assuming doing parallel could save me some time. However, the process viewer shows that it only use multiple threads in the first iteration. If using %do% instead of %dopar%, the code works fine on one thread.
thedata # my data
fit.ols # model on thedata
cl <- makeCluster(2)
registerDoParallel(cl)
for (i in 1:(nv - max(0,nk - 1))) {
fit2.ols <- fit.ols
pname <- names(fit.ols$Design$unit)
pname2 <- fit.ols$Design$name
pterm <- attr(fit.ols$terms, "term.labels")
drop1Model <- foreach (j = pname,
.packages = c("rms", "stats"),
.export = c("thedata", "pname2", "pterm")) %dopar% {
drop.var <- j
remove.index <-
which(unlist(lapply(strsplit(pname2," \\* "),
function(x)
any(!is.na(match(j,x))))))
remove.term <- pterm[remove.index]
model <- update(fit.ols,
as.formula(paste(
".~ . - ", paste(remove.term, collapse = "-"), sep = ""
)))
drop1Model <- list(drop.var = drop.var,
remove.index = remove.index,
remove.term = remove.term,
model = model)
}
# browser()c
rsq <-
sapply(drop1Model,function(x)
x$model$stats["R2"])
max.rsq <- max(rsq)
drop.index <- which(rsq == max.rsq)[1]
fit.ols <- drop1Model[[drop.index]]$model
dropTerms[[i]] <- drop1Model[[drop.index]]$remove.term
res[i + 1, 1] <- drop1Model[[drop.index]]$drop.var
res[i + 1, 2:3] <- c(i, fit.ols$stats[["R2"]])
if (fit.ols$stats[["R2"]] <= cutoff) {
minimodel <- fit2.ols
}
}
Note: I'm using R 3.2.5 with latest rms, foreach and doParallel package on Ubuntu 14.04.
When I using the doParallel library, I encountered this weird error, the system throws this
" Error in { : task 1 failed -could not find function "%dopar%"
To be specific, this is what I did
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Read the data
coin95 <-read.csv('~/Documents/coin95.csv')
coin95 <- coin95[,!(names(coin95) %in% c("X"))]
coin95[c("Person")] <- sapply(coin95[c("Person")],as.character)
# create the name list
coin95_name <- as.character(coin95$Person)
coin95_name <- unique(coin95_name)
n <- as.numeric(length(coin95_name))
# the average counting process
ntw <- function(now){
foreach (Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <-subset(coin95, subset = coin95$Person == Ii)$duration
stepfun(time,seq(0,length(time)))(now)/n
}
}
# the average cumulative hazard
lambda <- function(now,params){
b <- params[1]
sigma <- params[2]
mu <- params[3]
xi <- params[4]
beta1 <- params[5]
beta2 <- params[6]
k <- function(spread){
L0 <- (1+(spread -mu)*xi/sigma)^(-1/xi)
return(L0)
}
foreach(Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <- subset(coin95, subset = coin95$Person == Ii)$duration
noncov <- subset(coin95, subset = coin95$Person == Ii)$noncovered
reim <- subset(coin95, subset = coin95$Person == Ii)$reimbursement
(b*now+sum( exp(-k(now-time[(time < now)])+beta1*noncov[(time < now)]+beta2*reim[(time <now)]) ))/n
}
}
So far, everything is GOOD, I have created two functions ntw and lambda using the foreach. They worked perfectly.
Then I create the third function also using the foreach:
# the distance
Time <- coin95$duration
Time <- sort(as.double(Time))
jl <- function(params){
res<-foreach(Ii = Time,.combine = "rbind",.export = c("ntw","lambda")) %dopar% {
(ntw(Ii)-ntw(Ii-1e-7)) * (ntw(Ii)- lambda(Ii,params))^2
}
return(sqrt(sum(res)))
}
guess<-c(0.0,1.3333,0.0,0.1,-1.2,3e-3)
Type jl(guess):
> jl(guess)
Show Traceback
Rerun with Debug
Error in { : task 1 failed -could not find function "%dopar%"
Any Idea what's going wrong ?
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
Above packages are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. I should mention that it will most likely help even though you are not using these packages in your code.
I have tried running the following code on a Unix machine with 20 CPU, using R foreach, parallel, doParallel, and party packages (my objective is to have the party / varimp function working on several CPUs in parallel):
parallel_compute_varimp <- function (object, mincriterion = 0, conditional = FALSE, threshold = 0.2,
nperm = 1, OOB = TRUE, pre1.0_0 = conditional)
{
response <- object#responses
input <- object#data#get("input")
xnames <- colnames(input)
inp <- initVariableFrame(input, trafo = NULL)
y <- object#responses#variables[[1]]
error <- function(x, oob) mean((levels(y)[sapply(x, which.max)] != y)[oob])
w <- object#initweights
perror <- matrix(0, nrow = nperm * length(object#ensemble), ncol = length(xnames))
colnames(perror) <- xnames
data = foreach(b = 1:length(object#ensemble), .packages = c("party","stats"), .combine = rbind) %dopar%
{
try({
tree <- object#ensemble[[b]]
oob <- object#weights[[b]] == 0
p <- .Call("R_predict", tree, inp, mincriterion, -1L, PACKAGE = "party")
eoob <- error(p, oob)
for (j in unique(varIDs(tree))) {
for (per in 1:nperm) {
if (conditional || pre1.0_0) {
tmp <- inp
ccl <- create_cond_list(conditional, threshold, xnames[j], input)
if (is.null(ccl)) {
perm <- sample(which(oob))
}
else {
perm <- conditional_perm(ccl, xnames, input, tree, oob)
}
tmp#variables[[j]][which(oob)] <- tmp#variables[[j]][perm]
p <- .Call("R_predict", tree, tmp, mincriterion, -1L, PACKAGE = "party")
}
else {
p <- .Call("R_predict", tree, inp, mincriterion, as.integer(j), PACKAGE = "party")
}
perror[b, j] <- (error(p, oob) - eoob)
}
}
########
# return data to the %dopar% loop data variable
perror[b, ]
########
}) # END OF TRY
} # END OF LOOP WITH PARALLEL COMPUTING
perror = data
perror <- as.data.frame(perror)
return(MeanDecreaseAccuracy = colMeans(perror))
}
environment(parallel_compute_varimp) <- asNamespace('party')
cl <- makeCluster(detectCores())
registerDoParallel(cl, cores = detectCores())
<...>
system.time(data.cforest.varimp <- parallel_compute_varimp(data.cforest, conditional = TRUE))
but I am getting an error:
> system.time(data.cforest.varimp <- parallel_compute_varimp(data.cforest, conditional = TRUE))
Error in unserialize(socklist[[n]]) : error reading from connection
Timing stopped at: 58.302 13.197 709.307
The code was working with a smaller dataset on 4 CPUs.
I am running out of ideas. Can someone suggest a way to reach my objective of running party package varimp function on parallel CPUs?
The error:
Error in unserialize(socklist[[n]]) : error reading from connection
means that the master process got an error when calling unserialize to read from the socket connection to one of the workers. That probably means that the corresponding worker died, thus dropping its end of the socket connection. Unfortunately, it may have died for any number of reasons, many of which are very system specific.
You can usually figure out why the worker died by using the makeCluster "outfile" option so that the error message generated by the worker isn't thrown away. I usually recommend using outfile="" as described in this answer. Note that the "outfile" option works the same in both the snow and parallel packages.
You could also verify that your foreach loop works correctly when executed sequentially by registering the sequential backend:
registerDoSEQ()
If you're lucky, the foreach loop will fail when executed sequentially, since it's usually easier to figure out what is going wrong.