How to predict the outcome of a regression holding regressor constant? - r

Hi everyone based on the wage-dataset (wage being the dependent variable) and on the workflow created below, I would like to find out the following:
What is the predicted wage of a person with age equal to 30 for each piecewise model?
Considering the flexible pw6_wf_fit model configuration and in particular the six breakpoints above: Exceeding which (approximate) value of age correlates strongest with wage?
I tried to use versions of extract but so far I don´t know how to apply it in R. Helpful for any comment
The code I use is the following:
if (!require("pacman")) install.packages("pacman")
# load (or install if pacman cannot find an existing installation) the relevant packages
pacman::p_load(
tidyverse, tidymodels, ISLR, patchwork,
rpart, rpart.plot, randomForest, gbm, kernlab, parsnip, skimr
)
data(Wage, package = "ISLR")
Wage %>%
tibble::as_tibble() %>%
skimr::skim()
lin_rec <- recipe(wage ~ age, data = Wage)
# Specify as linear regression
lm_spec <-
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
plot_model <- function(wf_fit, data) {
predictions <-
tibble::tibble(age = seq(min(data$age), max(data$age))) %>%
dplyr::bind_cols(
predict(wf_fit, new_data = .),
predict(wf_fit, new_data = ., type = "conf_int")
)
p <- ggplot2::ggplot(aes(age, wage), data = data) +
geom_point(alpha = 0.05) +
geom_line(aes(y = .pred),
data = predictions, color = "darkgreen") +
geom_line(aes(y = .pred_lower),
data = predictions, linetype = "dashed", color = "blue") +
geom_line(aes(y = .pred_upper),
data = predictions, linetype = "dashed", color = "blue") +
scale_x_continuous(breaks = seq(20, 80, 5)) +
labs(title = substitute(wf_fit)) +
theme_classic()
return(p)
}
pw3_rec <- lin_rec %>% step_discretize(age, num_breaks = 3, min_unique = 5)
pw4_rec <- lin_rec %>% step_discretize(age, num_breaks = 4, min_unique = 5)
pw5_rec <- lin_rec %>% step_discretize(age, num_breaks = 5, min_unique = 5)
pw6_rec <- lin_rec %>% step_discretize(age, num_breaks = 6, min_unique = 5)
pw3_wf_fit <- workflow(pw3_rec, lm_spec) %>% fit(data = Wage)
pw4_wf_fit <- workflow(pw4_rec, lm_spec) %>% fit(data = Wage)
pw5_wf_fit <- workflow(pw5_rec, lm_spec) %>% fit(data = Wage)
pw6_wf_fit <- workflow(pw6_rec, lm_spec) %>% fit(data = Wage)
(plot_model(pw3_wf_fit, Wage) + plot_model(pw4_wf_fit, Wage)) /
(plot_model(pw5_wf_fit, Wage) + plot_model(pw6_wf_fit, Wage))

The answer to the first question is pretty straightforward:
map(list(pw3_wf_fit, pw4_wf_fit, pw5_wf_fit, pw6_wf_fit),
~predict(.x, new_data=tibble(age=30))) %>%
bind_rows()
# # A tibble: 4 × 1
# .pred
# <dbl>
# 1 99.3
# 2 94.2
# 3 92.3
# 4 89.5

Related

How can I replicate plot.lda() with of R `tidymodels`

I would like to replicate the plot.lda print method using ggplot2 and tidymodels. Is there an elegant way to get the plot?
I think I can fake the augment() function, which does not have a lda method, by using predict() and bind it onto the original data.
Here is an example with the base R and tidymodels code:
library(ISLR2)
library(MASS)
# First base R
train <- Smarket$Year < 2005
lda.fit <-
lda(
Direction ~ Lag1 + Lag2,
data = Smarket,
subset = train
)
plot(lda.fit)
# Next tidymodels
library(tidyverse)
library(tidymodels)
library(discrim)
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
the_rec <- recipe(
Direction ~ Lag1 + Lag2,
data = Smarket
)
the_workflow<- workflow() %>%
add_recipe(the_rec) %>%
add_model(lda_spec)
Smarket_train <- Smarket %>%
filter(Year != 2005)
the_workflow_fit_lda_fit <-
fit(the_workflow, data = Smarket_train) %>%
extract_fit_parsnip()
# now my attempt to do the plot
predictions <- predict(the_workflow_fit_lda_fit,
new_data = Smarket_train,
type = "raw"
)[[3]] %>%
as.vector()
bind_cols(Smarket_train, .fitted = predictions) %>%
ggplot(aes(x=.fitted)) +
geom_histogram(aes(y = stat(density)),binwidth = .5) +
scale_x_continuous(breaks = seq(-4, 4, by = 2))+
facet_grid(vars(Direction)) +
xlab("") +
ylab("Density")
There must be a better way to do this.... thoughts?
You can do this by using a combination of extract_fit_*() and parsnip:::repair_call(). The plot.lda() method uses the $call object in the LDA fit, which we need to adjust since the call object from using tidymodels will be different than using lda() directly.
library(ISLR2)
library(MASS)
# First base R
train <- Smarket$Year < 2005
lda.fit <-
lda(
Direction ~ Lag1 + Lag2,
data = Smarket,
subset = train
)
# Next tidymodels
library(tidyverse)
library(tidymodels)
library(discrim)
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
the_rec <- recipe(
Direction ~ Lag1 + Lag2,
data = Smarket
)
the_workflow <- workflow() %>%
add_recipe(the_rec) %>%
add_model(lda_spec)
Smarket_train <- Smarket %>%
filter(Year != 2005)
the_workflow_fit_lda_fit <-
fit(the_workflow, data = Smarket_train)
After fitting both models, we can inspect the $call objects and we see that they are different.
lda.fit$call
#> lda(formula = Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
extract_fit_engine(the_workflow_fit_lda_fit)$call
#> lda(formula = ..y ~ ., data = data)
The parsnip::repair_call() function will replace data with the data we pass in. Additionally, we will rename the response of the data to ..y to match the call.
the_workflow_fit_lda_fit %>%
extract_fit_parsnip() %>%
parsnip::repair_call(rename(Smarket_train, ..y = Direction)) %>%
extract_fit_engine() %>%
plot()
Created on 2021-11-12 by the reprex package (v2.0.1)

Caterpillar plot of posterior brms samples: Order factors in a ggdist plot (stat_slab)

I ran a bayesian linear mixed model with brms and can plot the estimates nicely but I can't figure out how to order the single-subject estimates based on the mean of the posterior samples (so as to get a caterpillar plot). This is what I've done.
Toy data:
library(brms)
library(tidybayes)
library(tidyverse)
n = 20
n_condition = 6
ABC =
tibble(
condition = rep(c("A","B","C","D","E","F"), n),
response = rnorm(n * 6, c(0,1,2,1,-1,-2), 0.5),
treatment = rnorm(n * 6, c(0,1,2,1,-1,-2), 0.5),
subject = c(rep("X",(n_condition*n)/3),rep("Y",(n_condition*n)/3),rep("Z",(n_condition*n)/3))
)
Add a shift for some subjects
ABC$response[ABC$subject == "X"] = 20 + ABC$response[ABC$subject == "X"]
ABC$response[ABC$subject == "Y"] = -20 + ABC$response[ABC$subject == "Z"]
Run the model
m = brm(
response ~ treatment + (1|condition) + (1|subject),
data = ABC,
cores = 4, chains = 1,
iter = 500, warmup = 50
)
Plot
m %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
mutate(subject_estimate = b_treatment + r_subject) %>%
mutate(subject = reorder(subject, sort(subject_estimate))) %>%
ggplot(aes(y = subject, x = subject_estimate)) +
stat_slab()
Gives me this:
The line mutate(subject = reorder(subject, sort(subject_estimate))) doesn't do anything, which might be fine as I probably need to reorder based on the mean of the posteriors, but when I try mutate(subject_order = reorder(subject, sort(mean(subject_estimate)))) I get the error message:
Error: Problem with mutate() input subject_order.
x arguments must have same length
ℹ Input subject_order is reorder(subject, sort(mean(subject_estimate))).
ℹ The error occurred in group 1: subject = "X".
Any pointers welcome
Two points for consideration:
Ungroup the result from spread_draws, otherwise you won't be able to reorder the levels of subject;
Use fct_reorder from the forcats package in tidyverse. It's designed for this exact purpose.
m %>%
spread_draws(b_treatment, r_subject[subject,]) %>%
ungroup() %>%
mutate(subject_estimate = b_treatment + r_subject) %>%
mutate(subject = fct_reorder(subject, subject_estimate, mean)) %>%
ggplot(aes(y = subject, x = subject_estimate)) +
stat_slab()
Result (data generated with set.seed(123)):

Prediction of single bagged tree models dependent on pre-processing using caret

I'm using the caret package to predict a time series with method treebag. caret estimates bagging regression trees with 25 bootstrap replications.
What I'm struggling to understand is how the final prediction of that 'treebag model' relates to the predictions made by each of the 25 trees, depending on whether I use caret::preProcess, or not.
I am aware of this question and the linked resources therein. (But could not draw the right conclusions from it.)
Here is an example using the economics data. Let's say I want to predict unemploy_rate, which has to be created first.
# packages
library(caret)
library(tidyverse)
# data
data("economics")
economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]
I wrote a function that extracts the 25 individual trees from the train object, makes a prediction for each tree, averages these 25 predictions, and compares this average with the prediction from the train object. It returns a plot.
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
# make a prediction for each tree
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
Now I estimate two models, the first will be unscaled, the second will be centered and scaled.
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
# estimate the models
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
# apply predict_from_treebag to each model
imap(.x = models,
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
The results are shown below. The unscaled model prediction is the average of the 25 trees but why is each prediction from the 25 trees a constant when I use preProcess?
Thank you for any advice where I might be wrong.
The problem is in this part of the code:
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
in the function predict_from_treebag
this predict function is in fact predict.rpart since
class(bagged_trees[[1]])
predict.rpart does not know that you pre-processed the data in caret.
Here is a quick fix:
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
x <- economics[, -c(1, 7)]
# make a prediction for each tree
newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
Now after running:
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
map2(.x = models,
.y = names(models),
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
the result is in line with the expected

Plot beta distribution in R

Using the dataset Lahman::Batting I've estimated parameters for the beta distribution. Now I want to plot this empirically derived beta distribution onto the histogram that I estimated it from.
library(dplyr)
library(tidyr)
library(Lahman)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
I can plot the distribution of RBI as:
career %>%
filter(AB > 500) %>%
ggplot(aes(x = average)) +
geom_histogram() +
geom_freqpoly(color = "red")
And obtain:
I know I can use + geom_freqpoly to obtain:
But I want the smooth beta distribution. I can estimate beta parameters by:
career_filtered <- career %>%
filter(AB >= 500)
m <- MASS::fitdistr(career_filtered$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1] # parameter 1
beta0 <- m$estimate[2] # parameter 2
Now that I have parameters alpha0 and beta0, how do I plot the beta distribution so that I obtain something like this:
This question is based on a post I'm reading here.
All code, including the code for the plots, can be found here. The following code is used to get the requested plot:
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Hope this helps.

How to Graph regression coefficients (or other estimates for model parameters) from (nested) regression models by condition?

I'm trying to plot regression coefficients from a (nested) dataframe (by condition) for which i ran four regression models for the four condtitions (with multiple predictors) on the nested data within each condition. Plotting the R-Squared values per model per condition (see example) works, but now I'd like to plot the regression coefficients first for x1 by condition (b's for x1 in descending order) and then same for x2 (or even facetted by predictor number), can someone help me out with the code?
Example of plotting R - Squared values for multiple models:
# creating data example
library(modelr)
library(tidyverse)
set.seed(123)
data <- tibble(
condition = replicate(40, paste(sample(c("A", "B", "C", "D"), 1, replace=TRUE))),
x1 = rnorm(n = 40, mean = 10, sd = 2),
x2 = rnorm(n = 40, mean = 5, sd = 1.5),
y = x1*rnorm(n = 40, mean = 2, sd = 1) + x2*rnorm(n = 40, mean = 3, sd = 2))
by_condition <- data %>%
group_by(condition) %>%
nest()
# looking at data from first condition
by_condition$data[[1]]
# regression model function
reg.model <- function(df) {
lm(y ~ x1 + x2,
data = df)
}
# creating column with models per condition
by_condition <- by_condition %>%
mutate(model = map(data, reg.model))
# looking at reg. model for first group
by_condition$model[[1]]
summary(by_condition$model[[1]])
# graphing R-squared (ascending) per model by condition
glance <- by_condition %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
glance %>%
ggplot(aes(x = reorder(condition, desc(r.squared)), y = r.squared)) +
geom_point() +
coord_flip() +
xlab("Condition") +
ggtitle("R Square of reg. model per Condition")
So this example works, but i don't know how to extract the coefficients seperately and plot those in descending order by condition in similar graphs. Thanks
I found the answer to plotting coefficients of (nested) regression models within different conditions (tidying kicks ass):
by_condition %>%
mutate(regressions = map(model, broom::tidy)) %>%
unnest(regressions)
by_condition
regression_output <- by_condition %>%
mutate(regressions = map(model, broom::tidy))
regression_coefficients <- regression_output %>%
unnest(regressions)
regression_coefficients %>%
ggplot(aes(x = term, y = estimate )) +
geom_point() +
coord_flip() +
facet_wrap(~ condition) +
xlab("predictor") +
ggtitle("Coefficients of reg. model per Condition")

Resources