Problems reading MATLAB .mat files with a foreach loop in R - r

I have over a thousand matlab files that I want to read into R. I use the R.matlab package to read them and I would like to parallel the operation.
However, once I call the loop (I am generating a single data set from all the .mat files) I get an error:
Error in { : task 1 failed - "not possible to encounter function
"readMat""
(I translated the part of the error between "", since my R is not in english)
without the foreach command, everything goes fine, but it takes too long. Here is the code
library(R.matlab)
library(plyr)
library(foreach)
library(doParallel)
a = list.files()
data <- readMat(a[1])
for(j in 2:length(a)) {
data1 <- readMat(a[j])
if (is.null(data1)==FALSE) {
data <- rbind.fill(data,data1)
}}
print(j)
}
with the foreach command I get the above error. Here is the code:
library(R.matlab)
library(plyr)
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
a = list.files()
data <- readMat(a[1])
foreach(j = 2:length(a)) %dopar% {
data1 <- readMat(a[j])
if (is.null(data1)==FALSE) {
data <- rbind.fill(data,data1)
}}
print(j)
}
Does it mean foreach and readMat should not be used together?

Just if anyone is wondering, I forgot to export R.matlab to each cluster node. Just needed to add .packages argument inside the foreach call
library(R.matlab)
library(plyr)
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
a <- list.files()
data <- readMat(a[1])
foreach(j = 2:length(a), .packages = c("plyr", "doParallel",
"R.matlab")) %dopar% {
data1 <- readMat(a[j])
if (!is.null(data1)) {
data <- rbind.fill(data,data1)
}
}

Related

Parallel computations with for each and %dopar% do not generate a file

I am trying to use the doParallel package with foreach and %dopar% for the first time as I need to increase the speed of my computation.
Although the code is executed without raising any error, the file is not stored in the output folder.
I have a list of file paths (list_files) and a function (my_function) that I previously validated using sapply. When I use sapply, the output is stored in the output location. Using foreach and %dopar% returns no output in my output location.
# Define my function and call it my_function
my_function <- function(input_dir, output_dir) {
tryCatch(
expr = {
file <- read.csv(input_dir,sep = "\t", col.names = c("column1", "column2"))
file <- as_tibble(file)
file_noNA <- file %>% filter(!is.na(column1))
name <- substr(input_dir, nchar(input_dir)-8, nchar(input_dir)-4)
save(file_noNA, file = paste0(output_dir, name, ".rds"))
}
)
}
library("parallel")
library("foreach")
library("doParallel")
# Set number of cores
n.cores <- 5
# Check doParallel package
doParallel::registerDoParallel(n.cores)
getDoParWorkers()
# Apply function with parallel computing
foreach(i = list_files) %dopar% function(x) {
my_function(
input_dir = x,
output_dir = output_location)
}
This is what I have tried (without success):
Assigned the result
used foreach(i = list_files, .combine = 'c') %dopar% function(x) {...}
Used a single file instead of list_files
Reduced the number of cores
Do I need to add an export statement, e.g.
.export=ls(envir=globalenv())
or
.export=ls() ?

How to initialize workers to use package functions in parallel

I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.
I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.
You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}

R foreach parallel and package variables

I have the following code where i get an error like:
object 'a' could not be found
This file is inside a package.
It works when i use %do% instead of %dopar%
a <- 2
fun1 <- function(x)
{
y <- x*a
return(y)
}
fun2 <- function(n)
{
foreach(data = 1:n, .combine = rbind, .multicombine = TRUE, .export = c("a","fun1")) %dopar%
{
load_all() # Works, get error if line is removed
y <- fun1(data)
return(y)
}
}
In my main file where i use devtools::load-all() to load the package and have used doParallel to run my foreach on multiple cores i execute fun2(5) when getting the error.
If i use a directly in the foreach function body everything works. But when i use a function that uses the a variable, i get the error.
My main issue is i wan't to be able to call the function fun1 from both fun2 aswell as stand alone from the package.
Cluster is created as
cl <- makeCluster(16)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
# %dopar% code
stopCluster(cl)

Assign function output with assign

I am using
library(foreach)
library(doSNOW)
And I have a function mystoploss(data,n=14)
I then call it like that (I want to loop over n=14 for now):
registerDoSNOW(makeCluster(4, type = "SOCK"))
foreach(i = 14) %dopar% {assign(paste("Performance",i,sep=""),
mystoploss(data=mydata,n=i))}
I then try to find Performance14 from above, but it is not assigned.
Is there some way to assign so the output will be in Performance14?
And if I use
foreach(i = 14) %dopar% {assign(paste("Performance",i,sep=""),
mystoploss(data=mydata,n=i),envir = .GlobalEnv)}
I get error :
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
worker initialization failed: Error in as.name
Best Regards
This is because the assign operations are happening in the worker processes. The vaues of the variables are being sent back (see your R session console) but not with the names you assigned. You need to capture these values and assign them names again. See this related question.
The following is an alternative that may be of help: asign the output of foreach to an intermediate variable and assign it to your desired variables in the current 'master process' environment.
PerformanceAll <- foreach(i = 1:14,.combine="c") %dopar% { mystoploss(data=mydata,n=i) } #pick .combine appropriately
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[i]) }
Here is the full example I tried:
library(foreach)
library(doSNOW)
mystoploss <- function(data=1,n=1){
return(runif(data)) #some operation, returns a scalar
}
mydata <- 1
registerDoSNOW(makeCluster(4, type = "SOCK"))
PerformanceAll <- foreach(i = 1:14,.combine="c") %dopar% { mystoploss(data=mydata,n=i) }#pick .combine appropriately
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[i]) }
Edit: If the output of mystoploss is a list, then do the following changes:
mystoploss <- function(data=1,n=1){#Example
return(list(a=runif(data),b=1)) #some operation, return a list
}
PerformanceAll <- foreach(i = 1:14) %dopar% { mystoploss(data=mydata,n=i) }#remove .combine
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[[i]]) } #double brackets

R cannot find %do% function

I am trying to learn how to use the parallel foreach loops in R:
I tried running the following code:
testParForEach<-function(){
#testing a parallel for each loop
#to parallelize loop:
library(foreach)
library(doSNOW)
cl<-makeCluster(2)
registerDoSNOW(cl)
resultdf <- foreach(i=1:8, .combine='rbind') %dopar% {
foreach(j=1:2, .combine='c') %do% {
l <- runif(1, i, 100)
i + j + l
}
}
return(resultdf)
#close cluster
stopCluster(cl)
}
(which I got from another post on Stackoverflow) but am getting the error:
Error in { : task 1 failed - "konnte Funktion "%do%" nicht finden"
which means "could not find function %do%".
Has anyone seen this error before?
That is because %do% is defined in the foreach package. The outer parallelization via foreach is starting separate R processes (in your case 2) which independently execute the to-be-parallelized code. But in those processes no foreach package is loaded. This is done via the .packages-paramter of the foreach-function.
An example you can find here.

Resources