I'm trying to run a foreach loop as follows:
foreach(i=1:n, .combine=c, .packages=c("parallel", "doParallel", "pracma", "oce", "ineq", "gsw", "seewave", "soundecology", "data.table", "openxlsx", "tuneR", "vegan")) %dopar%
res[i,] <- indices(files[i])
The custom function indices() uses readWave() from the tuneR package to read wave files from a folder and loop through them. Each time I run this, I get the following error:
Error in readWave(x) : Object 'i' not found
The problem does not occur in a for loop. I've googled this but nobody seems to have had this one. Can anyone please help?
Thanks #Roland for pointing me in the right direction. Yes, I was trying to use foreach in a conceptually wrong way, identical to how for loop works. I was able to get it to work by changing it so:
palpha <- foreach(i = 1:n, .combine = "rbind", .packages = p) %dopar% indices(files[i])
I was later able to write the list obtained from foreach to my res data frame so:
res <- as.data.frame(palpha)
Related
Apologies in advance, I will use pseudo-code. I tried to create a reprex but it didn't show the problem.
I am using the function parallel::parSapply to run a model and it's throwing an "object not found" error, even though the object is defined within the function. Below I present the structure of my "main function" and how I'm running the parallel.
#Intializing data
zp2 = data.frame() #Contains unique identifier "ID" and data inputs
list.of.packages = c() #contains all required librariers
ode_fun = function() #function of ordinary differential equations
times = seq(0,348) #times to run the ODE function
#Main function
mfun = function(k){
dd = zp2[k,]$ID
source("Script.R", local=TRUE) #takes the inputs from 'zp2' to quantify: 'inits', and 'new.inputs'. Saved in the same working directory
out = euler(y = inits, times, func= ode_fun, params=new.inputs)
return(out)
}
#Parallel
cores <- parallel::detectCores()
cl <- parallel::makeCluster(cores-1, type ="PSOCK")
parallel::clusterExport(cl=cl, ls(globalenv()))
parallel::clusterEvalQ(cl, sapply(c(list.of.packages),require, character.only=T))
res.input = NULL
tictoc::tic()
res.input <- parallel::parSapply(1:350, FUN=function(x)mfun(x),
cl = cl)
tictoc::toc()
parallel::stopCluster(cl)
This creates the error:
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: object 'inits' not found
After debugging, I realized that source was in fact creating and 'bringing back' "inits" and "new.inputs" -- I ran the function up to there and made it return both objects. But for some reason ode_fun was not recognizing those objects so it creates an error.
As a workaround, I split the function: 1. collect all the inputs, 2. run the ODE, but this is not very efficient and potentially confusing. Do you know what could be the problem and if there's a solution for it?
My goal is to do some operation on a dataframe as follows
exp_info <- data.frame(location.Id = 1:1e7,
x = rnorm(10))
For each location, I want to do the square of the x variable and write the individual file as csv. My actual computation is lengthier and has other stuffs so this is a simplistic example. This is how I am parallelising my task:
library(doParallel)
myClusters <- parallel::makeCluster(6)
doParallel::registerDoParallel(myClusters)
foreach(i = 1:nrow(exp_info),
.packages = c("dplyr","data.table"),
.errorhandling = 'remove',
.verbose = TRUE) %dopar%
{
rowRef <- exp_info[i, ]
rowRef <- rowRef %>% dplyr::mutate(x.sq = x^2)
fwrite(rowRef, paste0(i,'_iteration.csv'))
}
When I look at my working directory, I have all the individual csv files (1e7 csv files)
written out which says the above code is successful. However, my foreach loop does not end
even if all the files are written out and I have to kill the job which also does not generate any error. Does anyone have any idea why this could possibly happen?
I'm experiencing something similar. I don't know the answer but will add this: the same code and operation works on one computer, but fails to exit the for each loop on another computer. Hope this provides some direction.
I have a large number of netcdf files. Each of which is 300*300=90000 grids.
I tried to open each file in a loop, make all 90000 grids as a single column, open the next file and append it to the first column etc. hence I created a dataframe, where each column represents a netcdf file with 90000 rows.
The code is as follows.
files= list.files("C:/cygwin64/home/Suchi",pattern="3B-HHR.MS.MRG.3IMERG.2001",full.names=TRUE)
# Loop over files
for(i in 1:files) {
nc = ncdf4::nc_open(files[i])
lw = ncvar_get(nc,"pcp")
lw<-as.data.frame((lw))
lw<-as.data.frame(t(lw))
lw<-unlist((lw))
lw<-data.frame(lw)
# Add the values from each file to a single data.frame
cbind(df,data.frame(lw))->df
ncdf4::nc_close(nc)
}
The above code works fine. It is just taking too much time.
Please help me to do the same using foreach command in parallel processing.
I am getting the following error:
Error unlist(ncdf4::nc_open(files[i])) :
task 1 failed - "missing value where TRUE/FALSE needed"
When using foreach parallel processing..
I don't see your foreach loop, so I have made one for you. The error you are receiving may be due to the fact your loop is this:
for(i in 1:files)
which is wrong, since files is a vector not a number. It should instead be this:
for(i in 1:length(files))
Here is the foreach loop that I created, for your script. Let me know if this works:
library(parallel)
library(doParallel)
library(foreach)
files= list.files("C:/cygwin64/home/Suchi",pattern="3B-HHR.MS.MRG.3IMERG.2001",full.names=TRUE)
# Loop over files
cl = makeCluster(10)
registerDoParallel(cl)
foreach(i = 1:length(files)) %dopar% {
library(ncdf4)
nc = ncdf4::nc_open(files[i])
lw = ncvar_get(nc,"pcp")
lw<-as.data.frame((lw))
lw<-as.data.frame(t(lw))
lw<-unlist((lw))
lw<-data.frame(lw)
# Add the values from each file to a single data.frame
cbind(df,data.frame(lw))->df
ncdf4::nc_close(nc)
}
stopCluster(cl)
i wrote a loop:
for(a in 1:100){
Code
list <- list("test1"=test1,"test2"=test2)
save(list, file = paste(paste("test",a,sep="_"),".RData",sep=""))
}
The iterative naming of the saved file works well, but I have not figured out a way to do this the list. The Problem is, that if I load the file into R the objects are both called list and thus I have a problem.
I have tried mv(from = "list" , to = paste(paste("test",a,sep="_")) but it does not work.
Can anybody help me with this?
Indeed this is a tricky point, since save(eval(parse(text=paste0("list", a))), file = paste("test",a,".RData",sep="")) is not working for some reason, your best bet IMO would be to save one file only - which might be more convenient any way, and access the names of the objects in the list of lists:
test1 <- 1
test2 <- 2
mylist <- list()
for(a in 1:100){
#assign(paste0("list",a), list("test1"=test1,"test2"=test2), environment())
mylist[[a]] <- list("test1"=test1,"test2"=test2)
}
save(mylist, file = "mylist.RData")
I deal with some shapefiles and rasters.
When I execute my script line by line (or part by part), everything runs as expected. However, if I execute it as a whole (either source it or STRG+A and then STRG+ENTER), it throws an error in the following section:
# ... some code
list = list()
list = foreach(i = seq(from = 9, to = 80, by = 5)) %dopar% {
df[which(df#data$column.name > i), ]
}
# ... some code
Error message: Error in { : task 2 failed - "Object of type 'S4' is not subsettable"
Where fishnet is a SpatialPolygonsDataFrame. Code subsets my SpPolDaFr so I get 15 subsetted SpPolDaFr written in the list.
I was thinking of maybe being foreach the reason. However, I have other foreach calls a priori which run fine. I do the doParallel loop because my SpPolDaFr is 11 GB in size to speed things up.
You usually get this kind of error when the workers haven't loaded the package that defines the class of one of the variables. If the class of "df" is "SpatialPolygonsDataFrame" which is defined by the "sp" package, then you should use the foreach .packages="sp" option so the workers will be able to properly operate on "df".
I'd like to suggest using different cluster type when registering the multithreading. Unlike the default type, FORK creates a copy of the process. There's no need to specify the packages, but FORK is only available on Unix. The code should look like this:
cl <- makeCluster(N_CORES, type = "FORK")
registerDoParallel(cl)
list = foreach(i = seq(from = 9, to = 80, by = 5)) %dopar% {
df[which(df#data$column.name > i), ]
}
stopCluster(cl)