I am using the xgboost library in r. My model seems to run fine with the default objective reg:squarederror
This runs fine within my code e.g.
model_regression = map2(.x = dtrain_regression, .y = nrounds, ~xgboost(.x, nrounds = .y, objective = "reg:squarederror")))
Reading the docs, there is another potential objective listed, reg:squaredlogerror. I wanted to experiment with this objective:
model_regression = map2(.x = dtrain_regression, .y = nrounds, ~xgboost(.x, nrounds = .y, objective = "reg:squaredlogerror")))
However, when I run with this variation I get an error message that this objective is unknown.
Is it possible to use the objective reg:squaredlogerror within xgboost in r?
You want the latest xgboost. Install it with install_github, see the instructions here
(Don't expect CRAN to have the latest version of a package, esp. if under very active development (like xgboost is), it will lag by a release cycle. Generally the latest development build will be on github )
Try using reg:linear as objective, it will work :)
Related
Requesting your help or expert opinion on a parallelization issue I am facing.
I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) for a multiclass prediction. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.
I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.
makePSOCKcluster(), library(doParallel)
registerDoFuture(), library(doFuture)
doMC::registerDoMC()
plan(cluster, workers), doFuture, parallel
registerDoParallel(), library(doParallel)
future::plan(multisession), library(furrr)
Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid.
I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).
doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T)
tic()
race_rs <- future_map_dfr(
tune_race_anova(
xgb_earlystop_wf,
resamples = cv_folds,
metrics = xgb_metrics,
grid = stopping_grid,
control = control_race(
verbose = TRUE,
verbose_elim = TRUE,
allow_par = TRUE,
parallel_over = 'everything'
)
),
.progress = T,
.options = furrr_options(packages = "parsnip"),
)
toc()
Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!
Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default
The OS & Version details I use are as follows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
I am intrigued by how furrr/future option worked once, but failed in all attempts since.
I have also tried using the development version of tune
Any help or advice on parallelization options will be greatly appreciated.
Thanks
Rj
Apparently, in tidymodels code, the parallelization happens internally, and there is no need to use furrr/future to do manual parallel computation. Moreover, the above code may be syntactically incorrect. For a more detailed explanation of why this is please see this post by mattwarkentin in the R Studio community forum.
I'm new here. I've been struggling with analysing some data with the BaSTA package, the data works ok after running the "Datacheck" code , but right after running the following code this happens:
multiout <- multibasta(object = datosJ, studyStart = 1999, studyEnd = 2018, model = "LO",
shape = "simple", niter = 20001, burnin = 2001, thinning = 100,
parallel = TRUE)
No problems were detected with the data.
Starting simulation to find jump sd's... done.
Multiple simulations started...
**Error in setDefaultClusterOptions(type = .sfOption$type) :
could not find function "setDefaultClusterOptions"**
I believe this error has something to do with the usage of "parallel = TRUE" which is a function of the snow package that comes incorporated in the BaSTA package and makes the analysis run faster. If I don't use parallel the analysis takes weeks in running and I've been told that's not normal for the package I'm using.
Any help would be very helpful, thank you.
I came across this same behavior when using another R package that depends on snowfall. setDefaultClusterOptions is housed within a dependency of BaSTA so this is error message is because packages are not being loaded. Try calling library(snowfall) prior to running the BaSTA package command to see if that fixes it for you.
I use code like below to create an AutoML object to submit an experiment for classification training
automl_settings = {
"n_cross_validations": 2,
"primary_metric": 'accuracy',
"enable_early_stopping": True,
"experiment_timeout_hours": 1.0,
"max_concurrent_iterations": 4,
"verbosity": logging.INFO,
}
automl_config = AutoMLConfig(task = 'classification',
compute_target = compute_target,
training_data = train_data,
label_column_name = label,
**automl_settings
)
ws = Workspace.from_config()
experiment = Experiment(ws, "your-experiment-name")
run = experiment.submit(automl_config, show_output=True)
I want to include my conda yml file (like below) in my experiment submission.
env = Environment.from_conda_specification(name='myenv', file_path='conda_dependencies.yml')
However, I don't see any environment parameter in AutoMLConfig class documentation (similar to what environment parameter does in ScriptRunConfig) or find any example how to do so.
I notice after the experiment is submitted, I get message like this
Running on remote.
No run_configuration provided, running on aml-compute with default configuration
Is run_configuration used for specifying environment? If so, how do I provide run_configuration in my AutoML experiment run?
Thank you.
I figured out how to fix the issues associated with the sdk 1.19.0 upgrade in the AML environment I use, thus no need for the workaround (ie. pass in a SDK 1.18.0 conda environment file to AutoML experiment run) I was thinking about. My original question no longer needs an answer, I just want to add this note in case someone else has the same question later on.
I still don't know why AutoML experiment run has no option to pass in a conda environment file. It would be nice if a reason is given in the AML documentation.
I am crossposting this from https://stats.stackexchange.com/questions/488201/alternatives-to-or-fix-for-lmerconveniencefunctions-for-use-with-lme4 as it was suggested that people might have more relevant knowledge here.
Recently I updated from R version 3.6 to version 4.0 for my analyses, and noticed that LMERConvenienceFunctions stopped working. Specifically, I use it in conjunction with LME4.
Whenever I try to use the bfFixefLMER_F.fnc (backfitting of fixed effects in LMER models) or pamer.fnc (compute upper- and lower-bound p-values for the analysis of variance or deviance) for a LMER model fitted through LME4, regardless of dataset, I am met with the error "Error in pf(anova.table[term, "F value"], anova.table[term, "Df"], nrow(model#frame) - : Non-numeric argument to mathematical function". I have tried this on two separate computers with the same result. Now, as far as I can tell, LMERConvenienceFunctions hasn't been updated since 2015, so I'm not holding out hope that a fix is forthcoming.
I tried reverting to R 3.6.2, but found the same error using the versions of LME4 that were out shortly before R 4.0 came out. I have finally found the previous version I was using, so this will (hopefully) fix it for my current analysis, but doesn't help if I want to use the most recent version of R and LME4 going forwards.
Other functions of LMERConvenience (namely fitLMER.fnc and mcp.fnc) seem to be working properly, so it doesn't seem to be a systematic issue, but it is definitely one that significantly impedes my work.
Does anyone have any suggestions for alternative packages, or could anyone offer any advice on editing the LMERConvenienceFunctions package so that I can get the broken functions working again? I don't have any experience with changing the coding within packages, so would be starting from bare basics there.
I am also aware that there are some workarounds through adding extra code in my R script, as I did find in my searching for an answer that it had previously been a problem with the same package in 2014 (https://stat.ethz.ch/pipermail/r-sig-mixed-models/2014q2/022264.html), but I am not familiar with writing that kind of code, so would appreciate any guidance there as well.
I managed to get some help from the lovely people working with LME4 on github; for anyone who has this issue in the future, there are a few things of note:
I have emailed the listed maintainer of LMERConvenienceFunctions asking for them to update the CRAN version so that it abides by the rules of CRAN. Hopefully he does this.
For a fix for those using LME4 and LMERConvenienceFunctions, but who are NOT also using lmerTest, recently the anova table headers "Chi Df" and"Df" were updated to "Df" and "npar" respectively (reasons outlined here:https://github.com/lme4/lme4/issues/528). This is an issue for LMERConvenienceFunctions because pamer.fnc called to Df, so needs to be updated to call to npar instead. Further, bfFixefLMER_F.fnc calls to pamer.fnc, so is fixed when pamer.fnc is updated. For anyone unsure how, I found the code using getAnywhere() and modified it, so just copy the code below and paste it once right near the start of the file:
pamer.fnc <- function (model, ndigits = 4)
{
if (length(rownames(anova(model))) == 0) {
cat("nothing to evaluate: model has only an intercept.\n\n")
cat("printing model fixed effects:\n")
fixef(model)
}
else {
dims <- NULL
rank.X = qr(model#pp$X)$rank
anova.table = anova(model)
anova.table = cbind(anova.table, upper.den.df = nrow(model#frame) -
rank.X)
p.values.upper = as.numeric()
p.values.lower = as.numeric()
for (term in row.names(anova.table)) {
p.values.upper = c(p.values.upper, round(1 - pf(anova.table[term,
"F value"], anova.table[term, "npar"],
nrow(model#frame) - rank.X), ndigits))
model.ranef <- ranef(model)
lower.bound <- 0
for (i in 1:length(names(model.ranef))) {
dims <- dim(model.ranef[[i]])
lower.bound <- lower.bound + dims[1] * dims[2]
}
p.values.lower = c(p.values.lower, 1 - pf(anova.table[term,
"F value"], anova.table[term, "npar"],
nrow(model#frame) - rank.X - lower.bound))
}
dv <- gsub(" ", "", gsub("(.*)~.*",
"\\1", as.character(model#call)[2]))
ss.tot <- sum((model#frame[, dv] - mean(model#frame[,
dv]))^2)
aov.table <- as.data.frame(anova(model))
expl.dev <- vector("numeric")
for (i in rownames(aov.table)) {
expl.dev <- c(expl.dev, aov.table[i, 2]/ss.tot)
}
names(expl.dev) <- rownames(aov.table)
anova.table = round(cbind(anova.table, upper.p.val = p.values.upper,
lower.den.df = nrow(model#frame) - rank.X - lower.bound,
lower.p.val = p.values.lower, `expl.dev.(%)` = expl.dev *
100), ndigits)
return(anova.table)
}
}
(You may also need to run the function script for bfFixefLMER_F.fnc separately again, to let R know that bfFixefLMER_F.fnc should be calling from the updated version of pamer.fnc)
For a fix for those using LME4 and LMERConvenienceFunctions, but who ARE also using lmerTest, you will need to either 1) use numDF instead of npar, or 2) replace "anova.table = anova(model)" with "anova.table = anova(model, dff = "lme4") in versions of lmerTest 3.0 onwards. This appears to be due to the anova.lmerModLmerTest function (added in lmerTest 3.0) overwriting what the call to anova() does, with a Type III Satterthwaite analysis listed before lme4, leading to it being defaulted to when dff is not specified.
I am looking into using MXNet LSTM modelling for time-series analysis for a problem i am currently working on.
As a way of understanding how to implement this, I am following the example code given by xnNet from the link: https://mxnet.incubator.apache.org/tutorials/r/MultidimLstm.html
When running this script after downloading the necessary data to my local source, i am able to execute the code fine until i get to the following section to train the model:
## train the network
system.time(model <- mx.model.buckets(symbol = symbol,
train.data = train.data,
eval.data = eval.data,
num.round = 100,
ctx = ctx,
verbose = TRUE,
metric = mx.metric.mse.seq,
initializer = initializer,
optimizer = optimizer,
batch.end.callback = NULL,
epoch.end.callback = epoch.end.callback))
When running this section, the following error occurs once gaining connection to the API.
Error in mx.nd.internal.as.array(nd) :
[14:22:53] c:\jenkins\workspace\mxnet\mxnet\src\operator\./rnn-inl.h:359:
Check failed: param_.p == 0 (0.2 vs. 0) Dropout is not supported at the moment.
Is there currently a problem internally within the XNNet R package which is unable to run this code? I can't imagine they would provide a tutorial example for the package that is not executable.
My other thought is that it is something to do with my local device execution and connection to the API. I haven't been able to find any information about this being a problem for other users though.
Any inputs or suggestions would be greatly appreciated thanks.
Looks like you're running an old version of R package. I think following instructions on this page to build a recent R-package should resolve this issue.