Error occurring in caret when running on a cluster

Error occurring in caret when running on a cluster - r

I am running the train function in caret on a cluster via doRedis. For the most part, it works, but every so often I get errors at the very end of this nature:
error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>
and
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL
when I run traceback() I get:
5: nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,
ppOpts = preProcess, ctrl = trControl, lev = classLevels,
...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(couple ~ ., training.balanced, method = "nnet",
preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
1: caret::train(couple ~ ., training.balanced, method = "nnet",
preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
These errors are not easily reproducible (i.e. they happen sometimes, but not consistently) and only occur at the end of the run. The stdout on the cluster shows all tasks running and completed, so I am a bit flummoxed.
Has anyone encountered these errors? and if so understand the cause and even better a fix?

I imagine you've already solved this problem, but I ran into the same issue on my cluster consisting of linux and windows systems. I was running the server on ubuntu 14.04 and had noticed the warnings when starting the server service about having 'transparent huge pages' enabled in the linux kernel. I ignored that message and began running training exercises where most of the machines were maxed out with workers. I received the same error at the end of the run:
error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>
After a lot of head scratching and useless tinkering, I decided to address that warning by following the instructions here: http://ubuntuforums.org/showthread.php?t=2255151
Essentially, I installed hugeadm using:
sudo apt-get install hugeadm
Then disabled the transparent pages using:
hugeadm --thp-never
Note that this change will be undone on restart of the computer.
When I re-ran my training process it ran without any errors.
Hope that helps.
Cheers,
Eric

Related

Why does running the tutorial example fail with Rstan?

I installed the RStan successfully, the library loads. I try to run a tutorial example (https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started#example-1-eight-schools). After applying the stan function
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 1, # number of cores (could use one per chain)
refresh = 0 # no progress shown
)
I get the following error:
*Error in compileCode(f, code, language = language, verbose = verbose) :
C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb8internal26task_scheduler_observer_v3D0Ev[_ZN3tbb8internal26task_scheduler_observer_v3D0Ev]+0x1d): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface623task_scheduler_observerD1Ev[_ZN3tbb10interface623task_scheduler_observerD1Ev]+0x1d): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface623task_scheduler_observerD1Ev[_ZN3tbb10interface623task_scheduler_observerD1Ev]+0x3a): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface
Error in sink(type = "output") : invalid connection*
Simply running example(stan_model, run.dontrun=T) gives the same error.
What does this error mean?
Is rtools wrongly installed? My PATH seems to contain the correct folder C:\\rtools42\\x86_64-w64-mingw32.static.posix\\bin;. Is something wrong with the version of my Rstan package? I am struggling to interpret this error?
What to try to solve it?

Apparently Rtools42 in incompatible with the current version Rstan on CRAN. The solution that worked for me is found here: https://github.com/stan-dev/rstan/wiki/Configuring-C---Toolchain-for-Windows#r-42

Using "snow" parallel operations in bootstrap_parameters/model on merMod object (R)

I've been using bootstrap_parameters (parameters package in R) on generalised linear mixed models produced using glmmTMB. These work fine without parallel processing (parallel = "no") and also works fine on my old and slow mac using parallel = "multicore". I'm working on a new PC (Windows OS) so need to use parallel = "snow" however I get the following error:
system.time(b <- bootstrap_parameters(m1, iterations = 10, parallel = "snow", n_cpus = 6))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 0, 1
In addition: Warning message:
In lme4::bootMer(model, boot_function, nsim = iterations, verbose = FALSE, :
some bootstrap runs failed (10/10)
Timing stopped at: 0.89 0.3 7.11
If I select n_cpus = 1, the function works or if I feed bootstrap_parameters or bootstrap_model an lm object (where the underlying code uses boot::boot) it also works fine. I have narrowed the problem down to bootMer (lme4). I suspect the dataset exported using clusterExport is landing in an environment that is different from where clustered bootMer function is looking. The following is a reproduceable example
library(glmmTMB)
library(parameters)
library(parallel)
library(lme4)
m1 <- glmmTMB(count ~ mined + (1|site), zi=~mined,
family=poisson, data=Salamanders)
summary(m1)
cl <- makeCluster(6)
clusterEvalQ(cl, library("lme4"))
clusterExport(cl, varlist = c("Salamanders"))
system.time(b <- bootstrap_parameters(m1, iterations = 10, parallel = "snow", n_cpus = 6))
stopCluster(cl)
Any ideas on solving this problem?

You need to clusterEvalQ(cl, library("glmmTMB")). From https://github.com/glmmTMB/glmmTMB/issues/843:
This issue is more or less resolved by a documentation patch (we need to explicitly clusterEvalQ(cl, library("glmmTMB"))). The only question is whether we can make this any easier for users. There are two problems here: (1) when the user sets up their own cluster rather than leaving it to be done in bootMer, more explicit clusterEvalQ/clusterExport stuff is necessary in any case; (2) bootMer internally does parallel::clusterExport(cl, varlist=getNamespaceExports("lme4")) if it is setting up the cluster (not if the cluster is set up and passed to bootMer by the user), but we wouldn't expect it to extend the same courtesy to glmmTMB ...

Goodness of fit test 'ncores' argument produces error if included or if left to default

I (a novice) am trying to run a McKenzie-Bailey goodness of fit test on occupancy models using package AICcmodavg. No matter what I do I get an error regarding argument ncores which specifies how many of my computational cores to use in running the bootstraps. The package info says that if left blank it will default to 1 less than available cores. My computer has 4 cores. If I specify any number of cores (I've tried 0-4) I get an error that the argument is unused. If I do not specify ncores I get an error that it is missing. Code following, any suggestions appreciated :)
mb.gof.test(CamMod1, nsim = 1000, ncores = 3)
Error in statistic(object, ...) : unused argument (ncores = ncores)
mb.gof.test(CamMod1, nsim = 1000)
Error in parboot(mod, statistic = function(i) mb.chisq(i)$chi.square, : argument "ncores" is missing, with no default

H2O: Deep learning object not found in function 'predict' for argument 'model'

I'm just testing out h2o, in particular its deep learning capabilities, since I've heard great things about it. So far I've been using the following code:
library(h2o)
library(caret)
data("iris")
# Initiate H2O --------------------
h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G") # Start an H2O cluster with all threads available
# Get training and tournament data -------------------
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
# Convert target to factor -------------------
target <- as.factor(iris$Species)
feature_names <- names(train)[1:(ncol(train)-1)]
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
prob <- test[, "id", drop = FALSE]
model_dl <- h2o.deeplearning(x = feature_names, y = "target", training_frame = train_h2o, stopping_metric = "logloss")
h2o.logloss(model_dl)
pred_dl <- predict(model_dl, newdata = tourn_h2o)
prob <- cbind(prob, as.data.frame(pred_dl$p1, col.names = "dl"))
write.table(prob[, c("id", "dl")], paste0(model_dl#model_id, ".csv"), sep = ",", row.names = FALSE, col.names = c("id", "probability"))
The relevant part is really that last line, where I got the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Object 'DeepLearning_model_R_1494350691427_70' not found in function: predict for argument: model
Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.
EDIT: With the updated code I get the error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for DeepLearning model: DeepLearning_model_R_1494428751150_1. Details: ERRR on field: _train: Training data must have at least 2 features (incl. response).
ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.
I assume this has to do with the the way the Iris dataset is being read in.

Answer To First Question: Your original error message sounds like one you can get when things get of sync. E.g. maybe you had two sessions running at once, and removed the model in one session; the other session wouldn't know its variables are now out of date. H2O allows multiple connections, but they have to be co-operative. (Flow - see next paragraph - counts as a second session.)
Unless you can make a reproducible example, shrug and put it down to gremlins, and start a new session. Or, go and look at the data/models in Flow (a web server always running on 127.0.0.1:54321 ), and see if something is no longer there.
For your EDIT question, your model is making a regression model, but you are trying to use logloss, so thought you were doing a classification. This is caused by not having set the target variable to be a factor. Your current as.factor() line is on the wrong data, in the wrong place. It should go after your as.h2o() lines:
train_h2o <- as.h2o(training) #Typo fix
test_h2o <- as.h2o(test)
feature_names <- names(training)[1:(ncol(training)-1)] #typo fix
y = "Species" #The column we want to predict
train_h2o[,y] <- as.factor(train_h2o[,y])
test_h2o[,y] <- as.factor(test_h2o[,y])
And then make the model with:
model_dl <- h2o.deeplearning(x = feature_names, y = y, training_frame = train_h2o, stopping_metric = "logloss")
Get predictions with:
pred_dl <- predict(model_dl, newdata = test_h2o) #Typo fix
And compare with correct answer with the prediction using:
cbind(test[, y], as.data.frame(pred_dl$predict))
(BTW, H2O always detects the Iris data set columns as numeric vs. factor perfectly, so the above as.factor() lines are not needed; your error message must've been on your original data.)
StackOverflow advice: test your reproducible example, in full, and copy and paste in that exact code, with the exact error message that code is giving you. Your code had numerous small typos. E.g. train in places, training in others. createDataPartition() was not given; I assumed a = sample(nrow(iris), 0.8*nrow(iris)). test has no "id" column.
Other H2O advice:
Run h2o.removeAll() after h2o.init(). It was giving you an error message if run before. (Personally I avoid that function - it is the kind of thing that gets left in a production script by mistake...)
Consider importing your data into h2o earlier, and using h2o.splitFrame() to split it. I.e. avoid doing things in R that H2O can easily handle.
Avoid having your data in R, at all, if you can. Prefer importFile() over as.h2o().
The thinking beyond both the last points is that H2O will scale beyond the memory of one machine, while R won't. It also is less confusing than trying to keep track of the same thing in two places.

I had the same issue but could resolve it quite easily.
My error occured because I read in an h2o-object before initialising the h2o-cluster. So I trained an h2o-model, saved it, shut down the cluster, loaded in the model and then initialized the cluster once again.
Before reading in the h2o-object, you should already initialize the cluster (h2o.init()).

how to debug errors like: "dim(x) must have a positive length" with caret

I'm running a predict over a fit similar to what is found in the caret guide:
Caret Measuring Performance
predictions <- predict(caretfit, testing, type = "prob")
But I get the error:
Error in apply(x, 1, paste, collapse = ",") :
dim(X) must have a positive length
I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.
1)
So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():
10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel,
newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")
Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:
Github Caret C5.0 source
But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?
2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:
fitcontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = custom.summary,
allowParallel = TRUE)
fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
data = training[1:200000,],
method = "C5.0",
trControl = fitcontrol,
verbose = FALSE,
na.action = na.pass)

1 Debuging Procedure
You can pinpoint the problem using a couple of functions.
Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:
getAnywhere(makeDataFile)
to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.
In my particular problem if I run:
newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)
(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.
So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)
in caret/R/predict.train.R, predict.train (defined on line 108)
calls probFunction on line 153
in caret/R/probFunction, probFunction
(defined on line 3) calls method$prob function which is a stored
function in the fit object caretfit$modelInfo$prob which can be
inspected by entering this into the console. This is the same
function found in caret/models/files/C5.0.R on line 58 which calls
'predict' on line 59
something in caret knows to use
C50/R/predict.C5.0.R which you can see by searching with
getAnywhere()
this function runs makeDataFile on line 25 (part of
the C50 package)
which calls paste, which calls apply, which dies
with stop
2 Particular Problem with caret's predict
As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:
prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)
it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.
Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error occurring in caret when running on a cluster - r

Related

Why does running the tutorial example fail with Rstan?

Using "snow" parallel operations in bootstrap_parameters/model on merMod object (R)

Goodness of fit test 'ncores' argument produces error if included or if left to default

H2O: Deep learning object not found in function 'predict' for argument 'model'

how to debug errors like: "dim(x) must have a positive length" with caret

Categories

Resources