Requesting your help or expert opinion on a parallelization issue I am facing.
I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) for a multiclass prediction. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.
I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.
makePSOCKcluster(), library(doParallel)
registerDoFuture(), library(doFuture)
doMC::registerDoMC()
plan(cluster, workers), doFuture, parallel
registerDoParallel(), library(doParallel)
future::plan(multisession), library(furrr)
Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid.
I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).
doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T)
tic()
race_rs <- future_map_dfr(
tune_race_anova(
xgb_earlystop_wf,
resamples = cv_folds,
metrics = xgb_metrics,
grid = stopping_grid,
control = control_race(
verbose = TRUE,
verbose_elim = TRUE,
allow_par = TRUE,
parallel_over = 'everything'
)
),
.progress = T,
.options = furrr_options(packages = "parsnip"),
)
toc()
Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!
Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default
The OS & Version details I use are as follows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
I am intrigued by how furrr/future option worked once, but failed in all attempts since.
I have also tried using the development version of tune
Any help or advice on parallelization options will be greatly appreciated.
Thanks
Rj
Apparently, in tidymodels code, the parallelization happens internally, and there is no need to use furrr/future to do manual parallel computation. Moreover, the above code may be syntactically incorrect. For a more detailed explanation of why this is please see this post by mattwarkentin in the R Studio community forum.
Related
I'm new here. I've been struggling with analysing some data with the BaSTA package, the data works ok after running the "Datacheck" code , but right after running the following code this happens:
multiout <- multibasta(object = datosJ, studyStart = 1999, studyEnd = 2018, model = "LO",
shape = "simple", niter = 20001, burnin = 2001, thinning = 100,
parallel = TRUE)
No problems were detected with the data.
Starting simulation to find jump sd's... done.
Multiple simulations started...
**Error in setDefaultClusterOptions(type = .sfOption$type) :
could not find function "setDefaultClusterOptions"**
I believe this error has something to do with the usage of "parallel = TRUE" which is a function of the snow package that comes incorporated in the BaSTA package and makes the analysis run faster. If I don't use parallel the analysis takes weeks in running and I've been told that's not normal for the package I'm using.
Any help would be very helpful, thank you.
I came across this same behavior when using another R package that depends on snowfall. setDefaultClusterOptions is housed within a dependency of BaSTA so this is error message is because packages are not being loaded. Try calling library(snowfall) prior to running the BaSTA package command to see if that fixes it for you.
I use the maxnet function (maxnet package) as one of the model algorithms in an ensemble model. Sometimes, the code executes without an error. Other times, it gives me the error message you see below. I am working on a windows 10 Pro (R version 3.6.1, Rstudio version 1.2.5042).
Code:
dm.Maxent <- maxnet(p = train$species, data = train[-train$species],
maxnet.formula(p = train$species,
data = train[-train$species],
classes = "default"))
Error:
Error in intI(j, n = x#Dim[2], dn[[2]], give.dn = FALSE) :
index larger than maximal 185
train is a dataframe with 621 rows (one row for every occurrence/absence point), and 29 columns (28 columns containing variables and 1 column "species" that indicates presence or absence of the species (0/1)).
I am having the same issue. It is unpredictable, since for several species it ran fine, then out of a sudden it stopped.
I found a response on this link: https://github.com/jamiemkass/ENMeval/issues/62
In the new version of maxnet (check the Github repo, as it looks like the CRAN version gas not been updated yet), there is a new argument "addsamplestobackground". When set to TRUE, it solves some of these errors. Currently, you will have to use install_github to reinstall maxnet to use this argument. Once you do, install_github to get the dev branch version of ENMeval (v2), which will implement this by default. Hopefully that fixes these problems.
I reinstalled maxnet from github :
install.packages("remotes")
remotes::install_github("mrmaxent/maxnet")
and set addsamplestobackground = T Maybe this would help you.
I am using the xgboost library in r. My model seems to run fine with the default objective reg:squarederror
This runs fine within my code e.g.
model_regression = map2(.x = dtrain_regression, .y = nrounds, ~xgboost(.x, nrounds = .y, objective = "reg:squarederror")))
Reading the docs, there is another potential objective listed, reg:squaredlogerror. I wanted to experiment with this objective:
model_regression = map2(.x = dtrain_regression, .y = nrounds, ~xgboost(.x, nrounds = .y, objective = "reg:squaredlogerror")))
However, when I run with this variation I get an error message that this objective is unknown.
Is it possible to use the objective reg:squaredlogerror within xgboost in r?
You want the latest xgboost. Install it with install_github, see the instructions here
(Don't expect CRAN to have the latest version of a package, esp. if under very active development (like xgboost is), it will lag by a release cycle. Generally the latest development build will be on github )
Try using reg:linear as objective, it will work :)
I am running into problems when applying recursive feature selection to nnet models with caret::rfe; I get the following error message:
Error in { : task 1 failed - "undefined columns selected"
The actual task is more complex than the following example, but I am confident that this is a similar problem:
library(caret)
rfe(x = iris[,1:3],
y = iris[,4]/max(iris[,4]),
sizes = c(2),
method="nnet",
rfeControl = rfeControl(functions = caretFuncs)
)
I know this error can occur when trying to select more features than there are available in x (e.g. see https://stats.stackexchange.com/questions/18362/odd-error-with-caret-function-rfe), but this does not seem to be the problem here. I also ran very similar calls in earlier versions of caret, without this problem occurring.
I use R 3.3.1 and caret 6.0.71.
Thank you very much for your help.
EDIT: I went through the archived versions of caret and found that the example code is working in caret versions <= 6.0.62.
I went through the archived versions of caret and found that the example code is working in caret versions <= 6.0.62. This also solves the problems my original code had. I reported this issue on the caret github.
EDIT: The problem is now fixed : https://github.com/topepo/caret/issues/485
I have run a looong computation in WinBUGS (million iterations) using the R2WinBUGS package from within R:
bugs.object <- bugs(...)
but the R crashed. How do I reload the bugs.object into R again without running winbugs again? I tried this (I have 3 chains):
out <- read.bugs(paste("coda", 1:3, ".txt", sep = ""))
but the out data structure is completely diferent from the bugs object (as it is, it is unusable). I tried to convert it with as.bugs.array:
bugs.object <- as.bugs.array(out, model.file = "ttest.txt", n.iter = 1000000, n.burnin = 300000, n.thin = 2, program = "WinBUGS")
but it doesn't work. Please help. Thanks.
It is likely that you are reading a error message, where R ran out of memory to create the bugs.array object.
You can get round this problem by setting the codaPkg=T statement in the bugs function. This saves the CODA files in your specified working directory rather than creating the R2WinBUGS object (before R crashes). Then you can read the coda files back in using read.mcmc in the coda package, and if you really want, convert the mcmc object to a bugs.array.
This might not work if your MCMC is too big or you do not have enough memory for R.