Parallel execution of train in caret fails with function not found - r

yesterday I updated my R packages and since then parallel execution of the train function fails.
It seems like some functions that are called from within the workers are not available. These functions are at least flatTable and probFunction.
I experiencing this issues on my production machine, and was able to reproduce it on a clean Windows 7 x64 VM.
I added a minimal working example below. Dear users of stackoverflow: Any help is appreciated!
# R 3.0.2 x64, RStudio Version 0.98.490, Windows 7 x64
data(iris)
library(caret) # 6.0-21
library(doParallel) # 1.0.6
model <- "rf"
# Fail
?probFunction
?flatTable
fitControl <- trainControl(
method = "repeatedcv"
, number = 5 ## 5-fold CV
, repeats = 1 ## repeated one times
, verboseIter =TRUE
)
#### Sequential Version ####
# Runs
train(Species ~ ., data = iris, method = model, trControl = fitControl)
#### Parallelized version ####
# Fails with
# Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
# worker initialization failed: Error in eval(expr, envir, enclos): could not find function "flatTable"
cl <- makeCluster(3)
registerDoParallel(cl)
train(Species ~ ., data = iris, method = model, trControl = fitControl)
stopCluster(cl)
# Fails with
# Error in { : task 1 failed - "could not find function "probFunction""
fitControl <- trainControl(
method = "repeatedcv"
, number = 5 ## 5-fold CV
, repeats = 1 ## repeated one times
, verboseIter =TRUE
, classProbs = TRUE
)
cl <- makeCluster(3)
registerDoParallel(cl)
train(Species ~ ., data = iris, method = model, trControl = fitControl)
stopCluster(cl)
#### Again sequential version ####
# Fails with
# Error in summary.connection(connection) : invalid connection
train(Species ~ ., data = iris, method = model, trControl = fitControl)
R Session Info
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] e1071_1.6-1 class_7.3-9 randomForest_4.6-7 doParallel_1.0.6 iterators_1.0.6
[6] foreach_1.4.1 caret_6.0-21 ggplot2_0.9.3.1 lattice_0.20-23
loaded via a namespace (and not attached):
[1] car_2.0-19 codetools_0.2-8 colorspace_1.2-4 compiler_3.0.2 dichromat_2.0-0
[6] digest_0.6.4 grid_3.0.2 gtable_0.1.2 labeling_0.2 MASS_7.3-29
[11] munsell_0.4.2 nnet_7.3-7 plyr_1.8 proto_0.3-10 RColorBrewer_1.0-5
[16] reshape2_1.2.2 scales_0.2.3 stringr_0.6.2 tools_3.0.2

The error that you're getting is caused by a bug in caret 6.0-21 when using doParallel, doSNOW, and doMPI. It's been fixed in version 6.0-22 in R-forge, but hasn't been released to CRAN yet. If you don't want to wait for the new version to be released, you can:
Downgrade to caret 5.x
Install caret 6.0-22 from R-forge
Install and use doSNOW 1.0.10 from R-forge rather than doParallel
The problem was caused by a change in CRAN policy that forbids the use of the ::: operator, even when referencing non-exported functions from within the same package.
Update
Caret 6.0-22 was released to CRAN on 2014-01-18. This should resolve the reported problem using caret with doSNOW and similar parallel backends.

The first error (could not find function ...) disappears with newer versions, as suggested by #Steve Weston, but the second error (Error in summary.connection(connection) : invalid connection) persists.
With caret version 6.0.84, I could fix it by adding allowParallel = F to the trainControl arguments for the last sequential run.
The last part of the code in the question changes to:
#### Again sequential version (new) ####
fitControl_new <- trainControl(
method = "repeatedcv"
, number = 5
, repeats = 1
, verboseIter =TRUE
, classProbs = TRUE
, allowParallel = F ## add this argument to overwrite the default TRUE
)
train(Species ~ ., data = iris, method = model, trControl = fitControl_new)

Related

indirect indexing/subscripting inside %dopar%

I'm not understanding how to do indirect subscripting in %dopar% or in llply( .parallel = TRUE). My actual use-case is a list of formulas, then generating a list of glmer results in a first foreach %dopar%, then calling PBmodcomp on specific pairs of results in a separate foreach %dopar%. My toy example, using numeric indices rather than names of objects in the lists, works fine for %do% but not %dopar%, and fine for alply without .parallel = TRUE but not with .parallel = TRUE. [My real example with glmer and indexing lists by names rather than by integers works with %do% but not %dopar%.]
library(doParallel)
library(foreach)
library(plyr)
cl <- makePSOCKcluster(2) # tiny for toy example
registerDoParallel(cl)
mB <- c(1,2,1,3,4,10)
MO <- c("Full", "noYS", "noYZ", "noYSZS", "noS", "noZ",
"noY", "justS", "justZ", "noSZ", "noYSZ")
# Works
testouts <- foreach(i = 1:length(mB)) %do% {
# mB[i]
MO[mB[i]]
}
testouts
# all NA
testouts2 <- foreach(i = 1:length(mB)) %dopar% {
# mB[i]
MO[mB[i]]
}
testouts2
# Works
testouts3 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]} )
testouts3
# fails "$ operator is invalid for atomic vectors"
testouts4 <- alply(mB, 1, .fun = function(i) { MO[mB[i]]},
.parallel = TRUE,
.paropts = list(.export=ls(.GlobalEnv)))
testouts4
stopCluster(cl)
I've tried various combinations of double brackets like MO[mB[[i]]], to no avail. mB[i] instead of MO[mB[i]] works in all 4 and returns a list of the numbers. I've tried .export(c("MO", "mB")) but just get the message that those objects are already exported.
I assume that there's something I misunderstand about evaluation of expressions like MO[mB[i]] in different environments, but there may be other things I misunderstand, too.
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build
7601) Service Pack 1
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 [2]
LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United
States.1252 [4] LC_NUMERIC=C [5]
LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices
utils datasets methods [8] base
other attached packages: [1] plyr_1.8.4 doParallel_1.0.13
iterators_1.0.9 foreach_1.5.0
loaded via a namespace (and not attached): [1] compiler_3.5.1
tools_3.5.1 listenv_0.7.0 Rcpp_0.12.17 [5]
codetools_0.2-15 digest_0.6.15 globals_0.12.1 future_1.8.1
[9] fortunes_1.5-5
The problem appears to be with version 1.5.0 of foreach on r-forge. Version 1.4.4 from CRAN works fine for both foreach %do par% and llply( .parallel = TRUE). For anyone finding this post when searching for %dopar% with lists, here's the code where mList is a named list of formulas, and tList is a named list of pairs of model names to be compared.
tList <- list(Z1 = c("Full", "noYZ"),
Z2 = c("noYS", "noYSZS"),
S1 = c("Full", "noYS"),
S2 = c("noYZ", "noYSZS"),
A1 = c("noYSZS", "noY"),
A2 = c("noSZ", "noYSZ")
)
cl <- makePSOCKcluster(params$nCores) # value from YAML params:
registerDoParallel(cl)
# first run the models
modouts <- foreach(imod = 1:length(mList),
.packages = "lme4") %dopar% {
glmer(as.formula(mList[[imod]]),
data = dsn,
family = poisson,
control = glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun = 100000),
check.conv.singular = "warning")
)
}
names(modouts) <- names(mList)
####
# now run the parametric bootstrap tests
nSim <- 500
testouts <- foreach(i = seq_along(tList),
.packages = "pbkrtest") %dopar% {
PBmodcomp(modouts[[tList[[i]][1]]],
modouts[[tList[[i]][2]]],
nsim = nSim)
}
names(testouts) <- names(tList)
stopCluster(Cl)

R package mlr exhausts memory with multicore

I am trying to run a reproducible example with the mlr R package in parallel, for which I have found the solution of using parallelStartMulticore (link). The project runs with packrat as well.
The code runs properly on workstations and small servers, but running it in an HPC with the torque batch system runs into memory exhaustion. It seems that R threads are spawned ad infinitum, contrary to regular linux machines. I have tried to switch to parallelStartSocket, which works fine, but then I cannot reproduce the results with RNG seeds.
Here is a minimal example:
library(mlr)
library(parallelMap)
M <- data.frame(x = runif(1e2), y = as.factor(rnorm(1e2) > 0))
# Example with random forest
parallelStartMulticore(parallel::detectCores())
plyr::l_ply(
seq(100),
function(x) {
message("Iteration number: ", x)
set.seed(1, "L'Ecuyer")
tsk <- makeClassifTask(data = M, target = "y")
num_ps <- makeParamSet(
makeIntegerParam("ntree", lower = 10, upper = 50),
makeIntegerParam("nodesize", lower = 1, upper = 5)
)
ctrl <- makeTuneControlGrid(resolution = 2L, tune.threshold = TRUE)
# define learner
lrn <- makeLearner("classif.randomForest", predict.type = "prob")
rdesc <- makeResampleDesc("CV", iters = 2L, stratify = TRUE)
# Grid search in parallel
res <- tuneParams(
lrn, task = tsk, resampling = rdesc, par.set = num_ps,
measures = list(auc), control = ctrl)
# Fit optimal params
lrn.optim <- setHyperPars(lrn, par.vals = res$x)
m <- train(lrn.optim, tsk)
# Test set
pred_rf <- predict(m, newdata = M)
pred_rf
}
)
parallelStop()
The hardware of the HPC is an HP Apollo 6000 System ProLiant XL230a Gen9 Server blade 64-bit, with Intel Xeon E5-2683 processors. I ignore if the issue comes from the torque batch system, the hardware or any flaw in the above code. The sessionInfo() of the HPC:
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /cm/shared/apps/intel/parallel_studio_xe/2017/compilers_and_libraries_2017.0.098/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parallelMap_1.3 mlr_2.11 ParamHelpers_1.10 RLinuxModules_0.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 splines_3.4.0 munsell_0.4.3
[4] colorspace_1.3-2 lattice_0.20-35 rlang_0.1.1
[7] plyr_1.8.4 tools_3.4.0 parallel_3.4.0
[10] grid_3.4.0 packrat_0.4.8-1 checkmate_1.8.2
[13] data.table_1.10.4 gtable_0.2.0 randomForest_4.6-12
[16] survival_2.41-3 lazyeval_0.2.0 tibble_1.3.1
[19] Matrix_1.2-12 ggplot2_2.2.1 stringi_1.1.5
[22] compiler_3.4.0 BBmisc_1.11 scales_0.4.1
[25] backports_1.0.5
The "multicore" parallelMap backend uses parallel::mcmapply which should create a new fork()ed child process for every evaluation inside tuneParams and then quickly kill that process. Depending on what you use to count memory usage / active processes, it is possible that memory gets mis-reported and that child processes that are already dead (and were only alive for the fraction of a second) are shown, or that killing of finished processes for some reason does not happen.
Possible problems:
The batch system does not correctly track memory usage and counts the parent process's memory for every child separately. Does /usr/bin/free actually report that 30GB are gone while the script is running? As an easier test case, consider (running in an empty R session)
xxx <- 1:1e9
parallel::mclapply(1:4, function(x) {
Sys.sleep(60)
}, mc.cores = 4)
which should use about 4 GB of memory. If, during the 60 seconds in the child process, the reported memory usage is about 16 GB, it is this problem.
Memory reporting is accurate, but for some reason the memory space is changed a lot inside the child processes (triggering many COW writes), e.g. because of garbage collection. Does calling gc() before the tuneParams() call help?
Some setting on the machine prevents the "parallel" package from killing child processes. The following:
parallel::mclapply(1:4, function(x) {
xxx <<- 1:1e9 ; NULL
}, mc.cores = 4)
Sys.sleep(60)
should grab about 16 GB of memory, but release it right away. If the memory remains used during the Sys.sleep (and the remaining R session), it might be this problem.

Quantstrat WFA with intraday Data

I've been getting WFA to run on the full set of intraday GBPUSD 30min data, and have come across a couple of things that need addressing. The first is I believe the save function needs changing to remove the time from the string (as shown here as a pull request on the R-Finance/quantstrat repo on github). The walk.forward function throws this error:
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'wfa.GBPUSD.2002-10-21 00:30:00.2002-10-23 23:30:00.RData', probable reason 'Invalid argument'
The second is a rare case scenario where its ends up calling runSum on a data set with less rows than the period you are testing (n). This is the traceback():
8: stop("Invalid 'n'")
7: runSum(x, n)
6: runMean(x, n)
5: (function (x, n = 10, ...)
{
ma <- runMean(x, n)
if (!is.null(dim(ma))) {
colnames(ma) <- "SMA"
}
return(ma)
})(x = Cl(mktdata)[, 1], n = 25)
4: do.call(indFun, .formals)
3: applyIndicators(strategy = strategy, mktdata = mktdata, parameters = parameters,
...)
2: applyStrategy(strategy, portfolios = portfolio.st, mktdata = symbol[testing.timespan]) at custom.walk.forward.R#122
1: walk.forward(strategy.st, paramset.label = "WFA", portfolio.st = portfolio.st,
account.st = account.st, period = "days", k.training = 3,
k.testing = 1, obj.func = my.obj.func, obj.args = list(x = quote(result$apply.paramset)),
audit.prefix = "wfa", anchored = FALSE, verbose = TRUE)
The extended GBPUSD data used in the creation of the Luxor Demo includes an erroneous date (2002/10/27) with only 1 observation which causes this problem. I can also foresee this being an issue when testing longer signal periods on instruments like Crude where they have only a few trading hours on Sunday evenings (UTC).
Given that I have purely been following the Luxor demo with the same (extended) intra-day data set, are these genuine issues or have they been caused by package updates etc?
What is the preferred way for these things to be reported to the authors of QS, and find out if/when fixes are likely to be made?
SessionInfo():
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantstrat_0.9.1739 foreach_1.4.3 blotter_0.9.1741 PerformanceAnalytics_1.4.4000 FinancialInstrument_1.2.0 quantmod_0.4-5 TTR_0.23-1
[8] xts_0.9.874 zoo_1.7-13
loaded via a namespace (and not attached):
[1] compiler_3.3.0 tools_3.3.0 codetools_0.2-14 grid_3.3.0 iterators_1.0.8 lattice_0.20-33
quantstrat is on github here:
https://github.com/braverock/quantstrat
Issues and patches should be reported via github issues.

Error when using rfe in caret with a PLS-DA model

I'm trying to use rfe function from the caret package in combination with PLS-DA model.
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] splines grid parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_4.4 Kendall_2.2 doBy_4.5-13 survival_2.37-7 statmod_1.4.20
[6] preprocessCore_1.26.1 sva_3.10.0 mgcv_1.8-4 nlme_3.1-119 corpcor_1.6.7
[11] car_2.0-22 reshape2_1.4.1 gplots_2.16.0 DMwR_0.4.1 mi_0.09-19
[16] arm_1.7-07 lme4_1.1-7 Matrix_1.1-5 MASS_7.3-37 randomForest_4.6-10
[21] plyr_1.8.1 pls_2.4-3 caret_6.0-41 ggplot2_1.0.0 lattice_0.20-29
[26] pcaMethods_1.54.0 Rcpp_0.11.4 Biobase_2.24.0 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] abind_1.4-0 bitops_1.0-6 boot_1.3-14 BradleyTerry2_1.0-5 brglm_0.5-9 caTools_1.17.1
[7] class_7.3-11 coda_0.16-1 codetools_0.2-10 colorspace_1.2-4 compiler_3.1.1 digest_0.6.8
[13] e1071_1.6-4 foreach_1.4.2 foreign_0.8-62 gdata_2.13.3 gtable_0.1.2 gtools_3.4.1
[19] iterators_1.0.7 KernSmooth_2.23-13 minqa_1.2.4 munsell_0.4.2 nloptr_1.0.4 nnet_7.3-8
[25] proto_0.3-10 quantmod_0.4-3 R2WinBUGS_2.1-19 ROCR_1.0-5 rpart_4.1-8 scales_0.2.4
[31] stringr_0.6.2 tools_3.1.1 TTR_0.22-0 xts_0.9-7 zoo_1.7-11
To practice I ran the following example using the iris data.
data(iris)
subsets <- 2:4
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(Species ~., data = iris, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
All works well.
mod
Recursive feature selection
Outer resampling method: Cross-Validated (5 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
2 0.6533 0.48 0.02981 0.04472
3 0.8067 0.71 0.06412 0.09618 *
4 0.7867 0.68 0.07674 0.11511
The top 3 variables (out of 3):
Sepal.Width, Petal.Length, Sepal.Length
However, if I try to replicate this on data I have generated I get the following error. I can't work out why! If you have any ideas I'd be really interested in hearing them.
x <- as.data.frame(matrix(0,10,10))
for(i in 1:9) {x[,i] <- rnorm(10,0,1)}
x[,10] <- as.factor(rbinom(10, 1, 0.5))
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(V10 ~., data = x, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
4: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
5: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I have worked out (after a lot of to-ing and fro-ing) that levels of the response factor variable have to be characters to combine PLS-DA with RFE in caret.
For example...
x <- data.frame(matrix(rnorm(1000),100,10))
y <- as.factor(c(rep('Positive',40), rep('Negative',60)))
data <- data.frame(x,y)
subsets <- 2:9
ctrl <- rfeControl(functions = caretFuncs, method = 'cv', number = 5, verbose=TRUE)
trctrl <- trainControl(method='cv', number=5)
mod <- rfe(y ~., data, sizes = subsets, rfeControl = ctrl, trControl = trctrl, method = 'pls')

Missing object error when using step() within a user-defined function

5 days and still no answer
As can be seen by Simon's comment, this is a reproducible and very strange issue. It seems that the issue only arises when a stepwise regression with very high predictive power is wrapped in a function.
I have been struggling with this for a while and any help would be much appreciated. I am trying to write a function that runs several stepwise regressions and outputs all of them to a list. However, R is having trouble reading the dataset that I specify in my function arguments. I found several similar errors on various boards (here, here, and here), however none of them seemed to ever get resolved. It all comes down to some weird issues with calling step() in a user-defined function. I am using the following script to test my code. Run the whole thing several times until an error arises (trust me, it will):
test.df <- data.frame(a = sample(0:1, 100, rep = T),
b = as.factor(sample(0:5, 100, rep = T)),
c = runif(100, 0, 100),
d = rnorm(100, 50, 50))
test.df$b[10:100] <- test.df$a[10:100] #making sure that at least one of the variables has some predictive power
stepModel <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
sink()
return(output)
}
blah <- stepModel(a~., dataset = test.df)
This returns the following error message (if the error does not show up right away, keep re-running the test.df script as well as the call for stepModel(), it will show up eventually):
Error in is.data.frame(data) : object 'dataset' not found
I have determined that everything runs fine up until model.stepwise2 starts to get built. Somehow, the temporary object 'dataset' works just fine for the first stepwise regression, but fails to be recognized by the second. I found this by commenting out part of the function as can be seen below. This code will run fine, proving that the object 'dataset' was originally being recognized:
stepModel1 <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
# model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
# sink()
# output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
return(model.stepwise1)
}
blah1 <- stepModel1(a~., dataset = test.df)
EDIT - before anyone asks, all the summary() functions were there because the full function (i edited it so that you could focus in on the error) has another piece that defines a file to which you can output stepwise trace. I just got rid of them
EDIT 2 - session info
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.3 chron_2.3-43
[5] gsubfn_0.6-5 proto_0.3-10 DBI_0.2-6 ggplot2_0.9.3.1
[9] caret_5.15-61 reshape2_1.2.2 lattice_0.20-6 foreach_1.4.0
[13] cluster_1.14.2 plyr_1.8
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-1 dichromat_2.0-0 digest_0.6.2 grid_2.15.1
[6] gtable_0.1.2 iterators_1.0.6 labeling_0.1 MASS_7.3-18 munsell_0.4
[11] RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 tools_2.15
EDIT 3 - this performs all the same operations as the function, just without using a function. This will run fine every time, even when the algorithm doesn't converge:
modeling.formula <- a~.
dataset <- test.df
outfile <- NULL
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
Using do.call to refer to the data set in the calling environment works for me. See https://stackoverflow.com/a/7668846/210673 for the original suggestion. Here's a version that works (with sink code removed).
stepModel2 <- function(modeling.formula, dataset) {
model.initial <- do.call("glm", list(modeling.formula,
family = "binomial",
data = as.name(dataset)))
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
}
blah <- stepModel2(a~., dataset = "test.df")
It fails for me consistently with set.seed(6) with the original code. The reason it fails is that the dataset variable is not present within the step function, and although it's not needed in making model.stepwise1, it is needed for model.stepwise2 when model.stepwise1 keeps a linear term. So that's the case when your version fails. Calling the dataset from the global environment as I do here fixes this issue.

Resources