indirect indexing/subscripting inside %dopar% - r

I'm not understanding how to do indirect subscripting in %dopar% or in llply( .parallel = TRUE). My actual use-case is a list of formulas, then generating a list of glmer results in a first foreach %dopar%, then calling PBmodcomp on specific pairs of results in a separate foreach %dopar%. My toy example, using numeric indices rather than names of objects in the lists, works fine for %do% but not %dopar%, and fine for alply without .parallel = TRUE but not with .parallel = TRUE. [My real example with glmer and indexing lists by names rather than by integers works with %do% but not %dopar%.]
library(doParallel)
library(foreach)
library(plyr)
cl <- makePSOCKcluster(2) # tiny for toy example
registerDoParallel(cl)
mB <- c(1,2,1,3,4,10)
MO <- c("Full", "noYS", "noYZ", "noYSZS", "noS", "noZ",
"noY", "justS", "justZ", "noSZ", "noYSZ")
# Works
testouts <- foreach(i = 1:length(mB)) %do% {
# mB[i]
MO[mB[i]]
}
testouts
# all NA
testouts2 <- foreach(i = 1:length(mB)) %dopar% {
# mB[i]
MO[mB[i]]
}
testouts2
# Works
testouts3 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]} )
testouts3
# fails "$ operator is invalid for atomic vectors"
testouts4 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]},
.parallel = TRUE,
.paropts = list(.export=ls(.GlobalEnv)))
testouts4
stopCluster(cl)
I've tried various combinations of double brackets like MO[mB[[i]]], to no avail. mB[i] instead of MO[mB[i]] works in all 4 and returns a list of the numbers. I've tried .export(c("MO", "mB")) but just get the message that those objects are already exported.
I assume that there's something I misunderstand about evaluation of expressions like MO[mB[i]] in different environments, but there may be other things I misunderstand, too.
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build
7601) Service Pack 1
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 [2]
LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United
States.1252 [4] LC_NUMERIC=C [5]
LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices
utils datasets methods [8] base
other attached packages: [1] plyr_1.8.4 doParallel_1.0.13
iterators_1.0.9 foreach_1.5.0
loaded via a namespace (and not attached): [1] compiler_3.5.1
tools_3.5.1 listenv_0.7.0 Rcpp_0.12.17 [5]
codetools_0.2-15 digest_0.6.15 globals_0.12.1 future_1.8.1
[9] fortunes_1.5-5

The problem appears to be with version 1.5.0 of foreach on r-forge. Version 1.4.4 from CRAN works fine for both foreach %do par% and llply( .parallel = TRUE). For anyone finding this post when searching for %dopar% with lists, here's the code where mList is a named list of formulas, and tList is a named list of pairs of model names to be compared.
tList <- list(Z1 = c("Full", "noYZ"),
Z2 = c("noYS", "noYSZS"),
S1 = c("Full", "noYS"),
S2 = c("noYZ", "noYSZS"),
A1 = c("noYSZS", "noY"),
A2 = c("noSZ", "noYSZ")
)
cl <- makePSOCKcluster(params$nCores) # value from YAML params:
registerDoParallel(cl)
# first run the models
modouts <- foreach(imod = 1:length(mList),
.packages = "lme4") %dopar% {
glmer(as.formula(mList[[imod]]),
data = dsn,
family = poisson,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 100000),
check.conv.singular = "warning")
)
}
names(modouts) <- names(mList)
####
# now run the parametric bootstrap tests
nSim <- 500
testouts <- foreach(i = seq_along(tList),
.packages = "pbkrtest") %dopar% {
PBmodcomp(modouts[[tList[[i]][1]]],
modouts[[tList[[i]][2]]],
nsim = nSim)
}
names(testouts) <- names(tList)
stopCluster(Cl)

Related

Parallelization of Rcpp without inline/ creating a local package

I am creating a package that I hope to eventually put onto CRAN. I have coded much of the package in C++ with the help of Rcpp and now would like to enable parallelization of this C++ code. I am using the foreach package, however, I am open to switch to snow or a different library if this would work better.
I started by trying to parallelize a simple function:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::vec rNorm_c(int length) {
return arma::vec(length, arma::fill::randn);
}
/*** R
n_workers <- parallel::detectCores(logical = F)
cl <- parallel::makeCluster(n_workers)
doParallel::registerDoParallel(cl)
n <- 10
library(foreach)
foreach(j = rep(n, n),
.noexport = c("rNorm_c"),
packages = "Rcpp") %dopar% {rNorm_c(j)}
*/
I added the .noexport because without, I get the error Error in { : task 1 failed - "NULL value passed as symbol address". This led me to this SO post which suggested doing this.
However, I now receive the error Error in { : task 1 failed - "could not find function "rNorm_c"", presumably because I have not followed the top answers instructions to load the function separately at each node. I am unsure of how to do this.
This SO post demonstrates how to do this by writing the C++ code inline, however, since the C++ code for my package is multiple functions, this is likely not the best solution. This SO post advises to create a local package for the workers to load and make calls to, however, since I am hoping to make this code available in a CRAN package, it does not seem as though a local package would be possible unless I wanted to attempt to publish two CRAN packages.
Any suggestions for how to approach this or references to resources for parallelization of Rcpp code would be appreciated.
EDIT:
I used the above function to create a package called rnormParallelization. In this package, I also included a couple of R functions, one of which made use of the snow package to parallelize a for loop using the rNorm_c function:
rNorm_samples_for <- function(num_samples, length){
sample_mat <- matrix(NA, length, num_samples)
for (j in 1:num_samples){
sample_mat[ , j] <- rNorm_c(length)
}
return(sample_mat)
}
rNorm_samples_snow1 <- function(num_samples, length){
clus <- snow::makeCluster(3)
snow::clusterExport(clus, "rNorm_c")
out <- snow::parSapply(clus, rep(length, num_samples), rNorm_c)
snow::stopCluster(clus)
return(out)
}
Both functions work as expected:
> rNorm_samples_for(2, 3)
[,1] [,2]
[1,] -0.82040308 -0.3284849
[2,] -0.05169948 1.7402912
[3,] 0.32073516 0.5439799
> rNorm_samples_snow1(2, 3)
[,1] [,2]
[1,] -0.07483493 1.3028315
[2,] 1.28361663 -0.4360829
[3,] 1.09040771 -0.6469646
However, the parallelized version works considerably slower:
> microbenchmark::microbenchmark(
+ rnormParallelization::rNorm_samples_for(1e3, 1e4),
+ rnormParallelization::rNorm_samples_snow1(1e3, 1e4)
+ )
Unit: milliseconds
expr min lq
rnormParallelization::rNorm_samples_for(1000, 10000) 217.0871 249.3977
rnormParallelization::rNorm_samples_snow1(1000, 10000) 1242.8315 1397.7643
mean median uq max neval
320.5456 285.9787 325.3447 802.7488 100
1527.0406 1482.5867 1563.0916 3411.5774 100
Here is my session info:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rnormParallelization_1.0
loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_4.1.1 snow_0.4-4
[4] parallel_4.1.1 tools_4.1.1 Rcpp_1.0.7
GitHub repo with both of these scripts

Error in rep(" ", len) : invalid 'times' argument

library(OneR)
library(RWeka)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip
Need some help with my code. I am able to run everything but for some reason, every time I print loan_1R, it gives me an error. Tried using traceback() but have no idea what it means. My csv file can be in the link below.
https://drive.google.com/file/d/1139FUSXUc_fdzgtKAleo5bGAtjcVGoRC/view?usp=sharing
Error in rep(" ", len) : invalid 'times' argument
In addition: Warning message:
In max(nchar(names(model$rules))) :
no non-missing arguments to max; returning -Inf
> traceback()
3: cat("If ", model$feature, " = ", names(model$rules[iter]), rep(" ",
len), " then ", model$target, " = ", model$rules[[iter]],
"\n", sep = "")
2: print.OneR(x)
1: function (x, ...)
UseMethod("print")(x)
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-37 OneR_2.2
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 grid_3.4.1 rJava_0.9-9 RWekajars_3.9.2-1
After hours of testing i found out the problem but I have no idea why it is so. Think that it has something to do with the library(RWeka) package.... Placing library(RWeka) after the OneR code seemed to make it run. But this means i encounter the error only once i run the library(RWeka). Any workaround this?
library(OneR)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
library(RWeka)
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip

RWeka will not work with caret or possibly %dopar%

I am completing the exercises from Applied Predictive Modeling, the R textbook for the caret package, by the authors. I cannot get the train function to work with methods M5P or M5Rules.
The code will run fine manually:
data("permeability")
trainIndex <- createDataPartition(permeability[, 1], p = 0.75,
list = FALSE)
fingerNZV <- nearZeroVar(fingerprints, saveMetrics = TRUE)
trainY <- permeability[trainIndex, 1]
testY <- permeability[-trainIndex, 1]
trainX <- fingerprints[trainIndex, !fingerNZV$nzv]
testX <- fingerprints[-trainIndex, !fingerNZV$nzv]
indx <- createFolds(trainY, k = 10, returnTrain = TRUE)
ctrl <- trainControl('cv', index = indx)
m5Tuner <- t(as.matrix(expand.grid(
N = c(1, 0),
U = c(1, 0),
M = floor(seq(4, 15, length.out = 3))
)))
startTime <- Sys.time()
m5Tune <- foreach(tuner = m5Tuner) %do% {
m5ctrl <- Weka_control(M = tuner[3],
N = tuner[1] == 1,
U = tuner[2] == 1)
mods <- lapply(ctrl$index,function(fold) {
d <- cbind(data.frame(permeability = trainY[fold]),
trainX[fold, ])
mod <- M5P(permeability ~ ., d, control = m5ctrl)
rmse <- RMSE(predict(mod, as.data.frame(trainX[-fold, ])),
trainY[-fold])
list(model = mod, rmse = rmse)
})
mean_rmse <- mean(sapply(mods, '[[', 'rmse'))
list(models = mods, mean_rmse = mean_rmse)
}
endTime <- Sys.time()
endTime - startTime
# Time difference of 59.17742 secs
The same data and controls (swapping 'rules' for 'M' -why can't I specify M as a tuning parameter?) will not finish:
m5Tuner <- expand.grid(
pruned = c("Yes", "No"),
smoothed = c("Yes", "No"),
rules = c("Yes", "No")
)
m5Tune <- train(trainX, trainY,
method = 'M5',
trControl = ctrl,
tuneGrid = m5Tuner,
control = Weka_control(M = 10))
The example from the book will not finish, either:
library(caret)
data(solubility)
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
set.seed(100)
m5Tune <- train(x = solTrainXtrans, y = solTrainY,
method = "M5",
trControl = ctrl,
control = Weka_control(M = 10))
This may be a problem with the use of a parallel backend with RWeka, for me, at least. My example from above will not finish with %dopar%.
I have run sudo R CMD javareconf before each example and restarted Rstudio.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] APMBook_0.0.0.9000 RWeka_0.4-27
[3] caret_6.0-68 ggplot2_2.1.0
[5] lattice_0.20-30 AppliedPredictiveModeling_1.1-6
# dozens others loaded via namespace.
When using parallel processing with train and RWeka models, you should have gotten the error:
In train.default(trainX, trainY, method = "M5", trControl = ctrl, :
Models using Weka will not work with parallel processing with multicore/doMC
The java interface to Weka does not work with multiple workers.
It takes a while but the train call will complete if you don't have workers registered with foreach
Max

parallel data.table -- what's the correct syntax

Following up some data.table parallelism (1) (2) (3) I'm trying to figure it out. What's wrong with this syntax?
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
dt2 <- dt[, foo(.SD), by= "id"]
library(parallel)
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
# note that library(parallel) is annoying and you often have to do this type ("::", ":::") of exporting to the parallel package
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: incorrect number of dimensions
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x <- data.table::data.table(x)
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: object 'id' not found
I've played around with the syntax quite a bit. These two seem to be the closest I can get. And obviously something's still not right.
My real problem is similarly structured but has many more rows and I'm using a machine with 24 cores / 48 logical processors. So watching my computer use roughly 4% of it's computing power (by using only 1 core) is really annoying
You may want to evaluate Rserve solution for parallelism.
See below example build on Rserve using 2 R nodes locally in parallel. It can be distributed over remote instances also.
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
library(big.data.table)
# start 2 R instances
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# client side
rscl = rscl.connect(port = port, pkgs = "data.table") # connect and auto require packages
bdt = as.big.data.table(dt, rscl) # create big.data.table from local data.table and list of connections to R nodes
rscl.assign(rscl, "foo", foo) # assign `foo` function to nodes
bdt[, foo(.SD), by="id"][, foo(.SD), by="id"] # first query is run remotely, second locally
# id V1
# 1: 1 10.328998
# 2: 2 -8.448441
# 3: 3 21.475910
# 4: 4 -5.302411
# 5: 5 -11.929699
# ---
# 9996: 9996 -4.905192
# 9997: 9997 -4.293194
# 9998: 9998 -2.387100
# 9999: 9999 16.530731
#10000: 10000 -15.390543
# optionally with special care
# bdt[, foo(.SD), by= "id", outer.aggregate = TRUE]
session info:
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rserve_1.8-5 big.data.table_0.3.3 data.table_1.9.7
loaded via a namespace (and not attached):
[1] RSclient_0.7-3 tools_3.2.3

Missing object error when using step() within a user-defined function

5 days and still no answer
As can be seen by Simon's comment, this is a reproducible and very strange issue. It seems that the issue only arises when a stepwise regression with very high predictive power is wrapped in a function.
I have been struggling with this for a while and any help would be much appreciated. I am trying to write a function that runs several stepwise regressions and outputs all of them to a list. However, R is having trouble reading the dataset that I specify in my function arguments. I found several similar errors on various boards (here, here, and here), however none of them seemed to ever get resolved. It all comes down to some weird issues with calling step() in a user-defined function. I am using the following script to test my code. Run the whole thing several times until an error arises (trust me, it will):
test.df <- data.frame(a = sample(0:1, 100, rep = T),
b = as.factor(sample(0:5, 100, rep = T)),
c = runif(100, 0, 100),
d = rnorm(100, 50, 50))
test.df$b[10:100] <- test.df$a[10:100] #making sure that at least one of the variables has some predictive power
stepModel <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
sink()
return(output)
}
blah <- stepModel(a~., dataset = test.df)
This returns the following error message (if the error does not show up right away, keep re-running the test.df script as well as the call for stepModel(), it will show up eventually):
Error in is.data.frame(data) : object 'dataset' not found
I have determined that everything runs fine up until model.stepwise2 starts to get built. Somehow, the temporary object 'dataset' works just fine for the first stepwise regression, but fails to be recognized by the second. I found this by commenting out part of the function as can be seen below. This code will run fine, proving that the object 'dataset' was originally being recognized:
stepModel1 <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
# model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
# sink()
# output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
return(model.stepwise1)
}
blah1 <- stepModel1(a~., dataset = test.df)
EDIT - before anyone asks, all the summary() functions were there because the full function (i edited it so that you could focus in on the error) has another piece that defines a file to which you can output stepwise trace. I just got rid of them
EDIT 2 - session info
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.3 chron_2.3-43
[5] gsubfn_0.6-5 proto_0.3-10 DBI_0.2-6 ggplot2_0.9.3.1
[9] caret_5.15-61 reshape2_1.2.2 lattice_0.20-6 foreach_1.4.0
[13] cluster_1.14.2 plyr_1.8
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-1 dichromat_2.0-0 digest_0.6.2 grid_2.15.1
[6] gtable_0.1.2 iterators_1.0.6 labeling_0.1 MASS_7.3-18 munsell_0.4
[11] RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 tools_2.15
EDIT 3 - this performs all the same operations as the function, just without using a function. This will run fine every time, even when the algorithm doesn't converge:
modeling.formula <- a~.
dataset <- test.df
outfile <- NULL
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
Using do.call to refer to the data set in the calling environment works for me. See https://stackoverflow.com/a/7668846/210673 for the original suggestion. Here's a version that works (with sink code removed).
stepModel2 <- function(modeling.formula, dataset) {
model.initial <- do.call("glm", list(modeling.formula,
family = "binomial",
data = as.name(dataset)))
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
}
blah <- stepModel2(a~., dataset = "test.df")
It fails for me consistently with set.seed(6) with the original code. The reason it fails is that the dataset variable is not present within the step function, and although it's not needed in making model.stepwise1, it is needed for model.stepwise2 when model.stepwise1 keeps a linear term. So that's the case when your version fails. Calling the dataset from the global environment as I do here fixes this issue.

Resources