I fail to run caret's nnet regression - r

I tried to run regression using caret's nnet, but I got a error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "x1"
outcome = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "y1"
model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Metric RMSE not applicable for classification models
Of course, I want to regress, not classify. In order to show this, I set the option linout = TRUE. What went wrong?
Also, I followed this question, and tried to remove as.matrix, but it also shows other error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.double()
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Please use column names for 'x'
Thanks, a lot

The input of the Neural Network seems to require colnames.
This makes sense as the trained neural net might later need to identify the correct columns in its input for prediction.
if you force your features into a Dataframe with colnames (in my case x), it works.
library(tidyverse)
library(caret)
feature = data.frame(x = rnorm(100, 0, 1) %>% as.double()) # changed line
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)

Related

Error using glmulti with furrr::future_map

I am getting an error when I try to run glmulti on different datasets in parallel using furr::future_map. It works when future_map is called sequentially.
#Load packages
library(furrr)
library(future)
library(tidyverse)
#This doesn't work
plan(multisession, workers = 2 ) #Set number of parallel sessions (can't do multi core on Windows)
mods <- list(tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5)),
tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2, #Maximum number of possible models, so it doesn't run indefinitely
plotty = F, report = F #To simplify the outputs
))
Here is the error message this gives:
Error in get(as.character(FUN), mode = "function", envir = envir) : object 'aic' of mode 'function' was not found
It runs fine when done sequentially:
#This works
plan(multiprocess, workers = 1 ) #1 worker, so normal map behaviour.
mods <- list(tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5)),
tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2,
plotty = F, report = F
))
Is there any way to fix this? Or is it just a problem with one of the two packages? Is it more likely that it's an issue with furrr or with glmulti?

get fitted values from tidymodel implementation of glmnet

I am performing elastic net linear regression in tidymodels using the glmnet engine.
If I were to run this directly in glmnet I could do something like this:
cv_fit <- cv.glmnet(
y = response_vec,
x = predictor_matrix,
nfolds = 10,
alpha = 0.95,
type.measure = "mse",
keep = TRUE)
I can then get the fitted values like this:
fitted_y <- cv_fit$fit.preval
However, I cannot find how to get fitted values / residuals for the glmnet model fitted using parsnip. Any help appreciated.
What I was looking for is the control argument. save_pred = TRUE ensures that fitted values are stored within the returned object:
tuning_mod <- wf %>%
tune::tune_grid(
resample = rsample::vfold_cv(data = my_data, v = 10, repeats = 3),
grid = dials::grid_regular(x = dials::penalty(), levels = 200),
metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq),
control = control_resamples(save_pred = TRUE)
)
tune::collect_predictions(tuning_mod)

Xgboost Hyperparameter Tuning In R for binary classification

I am new to R and trying to do hyper parameter tuning for xgboost- binary classification, However I am getting error, I would appreciate if someone could help me
Error in as.matrix(cv.res)[, 3] : subscript out of bounds In addition: Warning message: 'early.stop.round' is deprecated. Use 'early_stopping_rounds' instead. See help("Deprecated") and help("xgboost-deprecated").
Please find below the code snippet`
I would appreciate if some one could provide another alternative too apart from this approach in R
X_Train <- as(X_train, "dgCMatrix")
GS_LogLoss = data.frame("Rounds" = numeric(),
"Depth" = numeric(),
"r_sample" = numeric(),
"c_sample" = numeric(),
"minLogLoss" = numeric(),
"best_round" = numeric())
for (rounds in seq(50,100, 25)) {
for (depth in c(4, 6, 8, 10)) {
for (r_sample in c(0.5, 0.75, 1)) {
for (c_sample in c(0.4, 0.6, 0.8, 1)) {
for (imb_scale_pos_weight in c(5, 10, 15, 20, 25)) {
for (wt_gamma in c(5, 7, 10)) {
for (wt_max_delta_step in c(5,7,10)) {
for (wt_min_child_weight in c(5,7,10,15)) {
set.seed(1024)
eta_val = 2 / rounds
cv.res = xgb.cv(data = X_Train, nfold = 2, label = y_train,
nrounds = rounds,
eta = eta_val,
max_depth = depth,
subsample = r_sample,
colsample_bytree = c_sample,
early.stop.round = 0.5*rounds,
scale_pos_weight= imb_scale_pos_weight,
max_delta_step = wt_max_delta_step,
gamma = wt_gamma,
objective='binary:logistic',
eval_metric = 'auc',
verbose = FALSE)
print(paste(rounds, depth, r_sample, c_sample, min(as.matrix(cv.res)[,3]) ))
GS_LogLoss[nrow(GS_LogLoss)+1, ] = c(rounds,
depth,
r_sample,
c_sample,
min(as.matrix(cv.res)[,3]),
which.min(as.matrix(cv.res)[,3]))
}
}
}
}
}
}
}
}
`
To do you hyperparameters selection, you could use the metapackage tidymodels, especially the packages parsnip, rsample, yardstick and tune.
A workflow like this would work:
library(tidyverse)
library(tidymodels)
# Specify the model and the parameters to tune (parnsip)
model <-
boost_tree(tree_depth = tune(), mtry = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
# Specify the resampling method (rsample)
splits <- vfold_cv(X_train, v = 2)
# Specify the metrics to optimize (yardstick)
metrics <- metric_set(roc_auc)
# Specify the parameters grid (or you can use dials to automate your grid search)
grid <- expand_grid(tree_depth = c(4, 6, 8, 10),
mtry = c(2, 10, 50)) # You can add others
# Run each model (tune)
tuned <- tune_grid(formula = Y ~ .,
model = model,
resamples = splits,
grid = grid,
metrics = metrics,
control = control_grid(verbose = TRUE))
# Check results
show_best(tuned)
autoplot(tuned)
select_best(tuned)
# Update model
tuned_model <-
model %>%
finalize_model(select_best(tuned)) %>%
fit(Y ~ ., data = X_train)
# Make prediction
predict(tuned_model, X_train)
predict(tuned_model, X_test)
Please note that the names during the model specification are subject to change compare to the original names in xgboost because parsnip is a unified interface with consistant names across several models. See here.

Set class weights in Keras of R when there are multiple outputs

I'm using the keras package in R to fit a neural network model. The model I'm working on has two outputs: output1 is continuous(for regression), output2 is binary(for classification).
Since we have a very imbalanced dataset for the classification problem(output2), I want to assign different class weights to deal with the imbalance, but apparently we don't need to do that for output1(the regression).
Here is the sample code for the NN model that I'm working on:
input <- layer_input(shape = c(32,24))
output <- input %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2)
pred1 <- output %>%
layer_dense(units = 1, name = "output1")
pred2 <- output %>%
layer_dense(units = 1, activation = "sigmoid", name = "output2")
model <- keras_model(
input,
list(pred1, pred2)
)
summary(model)
model %>% compile(
optimizer = "rmsprop",
loss = list(
output1 = "mse",
output2 = "binary_crossentropy"
),
loss_weights = list(
output1 = 0.25,
output2 = 10
)
)
history <- model %>% fit(
train_x, list(output1 = train_y1,output2 = train_y2),
epochs = 10,
batch_size = 5000,
class_weight = ???,
validation_data = list(valid_x, list(output1 = valid_y1,output2 = valid_y2))
)
If we just have one binary output, I know that the class weights can be assigned by:
class_weight = list("0"=1,"1"=100),
but it doesn't work anymore when we have two outputs and just want to assign the weights to one of them. I guess I may need to somehow specify the name of the binary output in "class_weight" so that it knows the weights only apply to output2, but I don't know how to do it in R.
Does anyone know how to assign class weights to the binary output only when we have two outputs(one is regression, one is classification)? Thank you very much for the help!

Running h2o Grid search on R

I am running h2o grid search on R. The model is a glm using a gamma distribution.
I have defined the grid using the following settings.
hyper_parameters = list(alpha = c(0, .5), missing_values_handling = c("Skip", "MeanImputation"))
h2o.grid(algorithm = "glm", # Setting algorithm type
grid_id = "grid.s", # Id so retrieving information on iterations will be easier later
x = predictors, # Setting predictive features
y = response, # Setting target variable
training_frame = data, # Setting training set
validation_frame = validate, # Setting validation frame
hyper_params = hyper_parameters, # Setting apha values for iterations
remove_collinear_columns = T, # Parameter to remove collinear columns
lambda_search = T, # Setting parameter to find optimal lambda value
seed = 1234, # Setting to ensure replicateable results
keep_cross_validation_predictions = F, # Setting to save cross validation predictions
compute_p_values = F, # Calculating p-values of the coefficients
family = 'gamma', # Distribution type used
standardize = T, # Standardizing continuous variables
nfolds = 2, # Number of cross-validations
fold_assignment = "Modulo", # Specifying fold assignment type to use for cross validations
link = "log")
When i run the above script, i get the following error:
Error in hyper_names[[index2]] : subscript out of bounds
Please can you help me find where the error is
As disucssed in the comments it is difficult to tell what the cause for the error could be without sample data and code. The out-of-bounds error could be because the code is trying to access a value that does not exist in the input. So possibly, it could be either of the inputs to the h2o.grid(). I would check columns and rows in the train and validation data sets. The hyperparameters from the question run fine with family="binomial".
The code below runs fine with glm(). I have made several assumptions such as: (1) family=binomial instead of family=gamma was used based on sample data created, (2) response y is binary, (3) train and test split ratio, (4) number of responses are limited to three predictors or independent variables (x1, x2, x3), (5) one binary response variable (y`).
Import libraries
library(h2o)
library(h2oEnsemble)
Create sample data
x1 <- abs(100*rnorm(100))
x2 <- 10+abs(100*rnorm(100))
x3 <- 100+abs(100*rnorm(100))
#y <- ronorm(100)
y <- floor(runif(100,0,1.5))
df <- data.frame(x1, x2, x3,y)
df$y <- ifelse(df$y==1, 'yes', 'no')
df$y <- as.factor(df$y)
head(df)
Initialize h2o
h2o.init()
Prepare data in required h2o format
df <- as.h2o(df)
y <- "y"
x <- setdiff( names(df), y )
df<- df[ df$y %in% c("no", "yes"), ]
h2o.setLevels(df$y, c("no","yes") )
# Split data into train and validate sets
data <- h2o.splitFrame( df, ratios = c(.6, 0.15) )
names(data) <- c('train', 'valid', 'test')
data$train
Set parameters
grid_id <- 'glm_grid'
hyper_parameters <- list( alpha = c(0, .5, 1),
lambda = c(1, 0.5, 0.1, 0.01),
missing_values_handling = c("Skip", "MeanImputation"),
tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
#tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
seed = 1234
)
Fit h2o.grid()
h2o.grid(
algorithm = "glm",
#grid_id = grid_id,
hyper_params = hyper_parameters,
training_frame = data$train,
validation_frame = data$valid,
x = x,
y = y,
lambda_search = TRUE,
remove_collinear_columns = T,
keep_cross_validation_predictions = F,
compute_p_values = F,
standardize = T,
nfolds = 2,
fold_assignment = "Modulo",
family = "binomial"
)
Output

Resources