get fitted values from tidymodel implementation of glmnet - r

I am performing elastic net linear regression in tidymodels using the glmnet engine.
If I were to run this directly in glmnet I could do something like this:
cv_fit <- cv.glmnet(
y = response_vec,
x = predictor_matrix,
nfolds = 10,
alpha = 0.95,
type.measure = "mse",
keep = TRUE)
I can then get the fitted values like this:
fitted_y <- cv_fit$fit.preval
However, I cannot find how to get fitted values / residuals for the glmnet model fitted using parsnip. Any help appreciated.

What I was looking for is the control argument. save_pred = TRUE ensures that fitted values are stored within the returned object:
tuning_mod <- wf %>%
tune::tune_grid(
resample = rsample::vfold_cv(data = my_data, v = 10, repeats = 3),
grid = dials::grid_regular(x = dials::penalty(), levels = 200),
metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq),
control = control_resamples(save_pred = TRUE)
)
tune::collect_predictions(tuning_mod)

Related

How to get test data ROC plot from MLeval

I am trying to return the ROC curves for a test dataset using the MLevals package.
# Load data
train <- readRDS(paste0("Data/train.rds"))
test <- readRDS(paste0("Data/test.rds"))
# Create factor class
train$class <- ifelse(train$class == 1, 'yes', 'no')
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary(),
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 18000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build using a gradient boosted machine
set.seed(5627)
gbm <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Predict results -
pred <- predict(gbm, newdata = test, type = "prob")[,"yes"]
roc <- evalm(data.frame(pred, test$class))
I have used the following post, ROC curve for the testing set using Caret package,
to try and plot the ROC from test data using MLeval and yet I get the following error message:
MLeval: Machine Learning Model Evaluation
Input: data frame of probabilities of observed labels
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
Can anyone please help? Thanks.
Please provide a reproducible example with sample data so we can replicate the error and test for solutions (i.e., we cannot access train.rds or test.rds).
Nevertheless, the below may fix your issue.
pred <- predict(gbm, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

L2 regularized MLR using caret, and how to make sure I am using best model while predicting

I am trying to do L2-regularized MLR on a data set using caret. Following is what I have done so far to achieve this:
r_squared <- function ( pred, actual){
mean_actual = mean (actual)
ss_e = sum ((pred - actual )^2)
ss_total = sum ((actual-mean_actual)^2 )
r_squared = 1 - (ss_e/ss_total)
}
df = as.data.frame(matrix(rnorm(10000, 10, 3), 1000))
colnames(df)[1] = "response"
set.seed(753)
inTraining <- createDataPartition(df[["response"]], p = .75, list = FALSE)
training <- df[inTraining,]
testing <- df[-inTraining,]
testing_response <- base::subset(testing,
select = c(paste ("response")))
gridsearch_for_lambda = data.frame (alpha = 0,
lambda = c (2^c(-15:15), 3^c(-15:15)))
regression_formula = as.formula (paste ("response", "~ ", " .", sep = " "))
train_control = trainControl (method="cv", number =10,
savePredictions =TRUE , allowParallel = FALSE )
model = train (regression_formula,
data = training,
trControl = train_control,
method = "glmnet",
tuneGrid =gridsearch_for_lambda,
preProcess = NULL
)
prediction = predict (model, newdata = testing)
testing_response[["predicted"]] = prediction
r_sq = round (r_squared(testing_response[["predicted"]],
testing_response[["response"]] ),3)
Here I am concerned about assurance that the model I am using for prediction is the best one (the optimal tuned lambda value).
P.S.: The data is sampled from random normal distribution, which is not giving a good R^2 value, but I want to get the idea correctly

R not valid variable name for caret function

I want to use train caret function to investigate xgboost results
#open file with train data
trainy <- read.csv('')
# open file with test data
test <- read.csv('')
# we dont need ID column
##### Removing IDs
trainy$ID <- NULL
test.id <- test$ID
test$ID <- NULL
##### Extracting TARGET
trainy.y <- trainy$TARGET
trainy$TARGET <- NULL
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
# train the model for each parameter combination in the grid,
# using CV to evaluate
xgb_train_1 = train(
x = as.matrix(trainy),
y = as.factor(trainy.y),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
I see this error
Error in train.default(x = as.matrix(trainy), y = as.factor(trainy.y), trControl = xgb_trcontrol_1, :
At least one of the class levels is not a valid R variable name;
I have looked at other cases but still cant understand what I should change? R is quite different from Python for me for now
As I can see I should do something with y classes variable, but what and how exactly ? Why didnt as.factor function work?
I solved this issue, hope it will help to all novices
I needed to transofm all data to factor type in the way like
trainy[] <- lapply(trainy, factor)

Why does caret's "parRF" lead to tuning and missing value errors not present with "rf"

I have a tidy dataset with no missing values and only numeric columns.
The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.
I partition this data into training and testing sets with caret's createDataPartition:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
I've been fitting and refitting Random Forest models on this data via randomForest on a regular basis:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
I decided to see if I could get any better or faster results with train and the following model ran fine, but took about 2 hours:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
I need to take an HPC approach to make this logistically feasible, so I tried
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
I say this is weird both because there were no errors with method = "rf" and because I tripled checked to ensure there are no missing values.
I even get the same errors when completely omitting tuning options. I also tried toggling the na.action option on and off and changing "cv" to "repeatedcv".
I even get the same error with this ultra-simplified version:
rf_train <- train(y=y, x=x, method='parRF')
Seems to be because of a bug in caret. See the answer to:
parRF on caret not working for more than one core
Just dealt with this same issue, loading foreach on each new cluster manually seems to work.

Resources