multiple models: how to select best model and make prediction - r

My task is to create many models, choose model that predict best and pass data to this model for prediction. Example inspired from R for data science book
library(modelr)
library(tidyverse)
library(gapminder)
gapminder
country_model1 <- function(df) {lm(lifeExp ~ year, data = df)}
country_model2 <- function(df) {lm(lifeExp ~ year+gdpPercap, data = df)}
country_model3 <- function(df) {lm(lifeExp ~ year+gdpPercap+pop, data = df)}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(model1 = map(data, country_model1),
model2 = map(data, country_model2),
model3 = map(data, country_model3))
So I have 3 models for each country.
I can find r squared for each model, but stopped here :(
r_sq <- by_country %>%
mutate(glance1 = map(model1, broom::glance),
glance2 = map(model2, broom::glance),
glance3 = map(model3, broom::glance)) %>%
unnest(glance1:glance3, .drop = TRUE) %>%
select(country, continent, starts_with('r.sq'))
How to in tidy way:
select which of 3 make better prediction for each particular country?
pass new data to chosen model and have prediction back?

We can identify the model with the highest r^2 for for each country like this:
best_fits <- r_sq %>%
pivot_longer(-c(country, continent), names_to = "r_sq_version") %>%
group_by(country, continent) %>%
slice_max(value) %>%
ungroup()
Not too surprisingly, the third model (called here r.squared2 from its name in r_sq) consistently provides the highest correlation, since that model takes more inputs and has more degrees of freedom.
Let's make some new data, taking the original but adding 100 years to the dates.
by_country_new <- gapminder %>%
group_by(country, continent) %>%
mutate(year = year + 100,
gdpPercap = gdpPercap,
pop = pop) %>%
select(-lifeExp) %>% # Presumably we don't know this and are trying to predict using known data
nest()
We could then apply the best model for each country to the new data: (Thanks to #mrflick for https://stackoverflow.com/a/63201855/6851825)
best_fits %>%
left_join(by_country) %>%
left_join(by_country_new, by = c("country", "continent")) %>%
mutate(best_model = case_when(
r_sq_version == "r.squared2" ~ model3,
r_sq_version == "r.squared1" ~ model2,
r_sq_version == "r.squared" ~ model1,
)) %>%
select(-c(model1:model3)) %>%
mutate(prediction = map2(best_model, data.y,
~broom::augment(.x, newdata = .y))) -> new_fits
We can then see how these predictions look like a continuation of the time trend established in the original data (with some other variation due to changes in population and gdp in our new data).
new_predictions <- new_fits %>%
filter(country == "Afghanistan") %>%
select(prediction) %>%
unnest_wider(prediction) %>%
flatten_dfr() %>%
rename(lifeExp = ".fitted")
gapminder %>%
filter(country == "Afghanistan") %>%
bind_rows(new_predictions) %>%
ggplot(aes(year, lifeExp)) +
geom_point() +
labs(title = "Afghanistan extrapolated lifeExp")

Related

Predict in workflow throws that column doesn't exist

Given the following code
library(tidyverse)
library(lubridate)
library(tidymodels)
library(ranger)
df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/bitcoin.csv")
df <- df %>%
mutate(Date = dmy(Date),
Change_Percent = str_replace(Change_Percent, "%", ""),
Change_Percent = as.double(Change_Percent)
) %>%
filter(year(Date) > 2017)
int <- interval(ymd("2020-01-20"),
ymd("2022-01-15"))
df <- df %>%
mutate(covid = ifelse(Date %within% int, T, F))
df %>%
ggplot(aes(x = Date, y = Price, color = covid)) +
geom_line()
df <- df %>%
arrange(Date) %>%
mutate(lag1 = lag(Price),
lag2 = lag(lag1),
lag3 = lag(lag2),
profit_next_day = lead(Profit))
# modelatge
df_mod <- df %>%
select(-covid, -Date, -Vol_K, -Profit) %>%
mutate(profit_next_day = as.factor(profit_next_day))
set.seed(42)
data_split <- initial_split(df_mod) # 3/4
train_data <- training(data_split)
test_data <- testing(data_split)
bitcoin_rec <-
recipe(profit_next_day ~ ., data = train_data) %>%
step_naomit(all_outcomes(), all_predictors()) %>%
step_normalize(all_numeric_predictors())
bitcoin_prep <-
prep(bitcoin_rec)
bitcoin_train <- juice(bitcoin_prep)
bitcoin_test <- bake(bitcoin_prep, test_data)
rf_spec <-
rand_forest(trees = 200) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
bitcoin_wflow <-
workflow() %>%
add_model(rf_spec) %>%
add_recipe(bitcoin_prep)
bitcoin_fit <-
bitcoin_wflow %>%
fit(data = train_data)
final_model <- last_fit(bitcoin_wflow, data_split)
collect_metrics(final_model)
final_model %>%
extract_workflow() %>%
predict(test_data)
The last chunk of code that extracts the workflow and predicts the test_data is throwing the error:
Error in stop_subscript(): ! Can't subset columns that don't exist.
x Column profit_next_day doesn't exist.
but profit_next_day exists already in test_data, as I checked multiple times, so I don't know what is happening. Never had this error before working with tidymodels.
The problem here comes from using step_naomit() on the outcome. In general, steps that change rows (such as removing them) can be pretty tricky when it comes time to resample or predict on new data. You can read more in detail in our book, but I would suggest that you remove step_naomit() altogether from your recipe and change your earlier code to:
df_mod <- df %>%
select(-covid, -Date, -Vol_K, -Profit) %>%
mutate(profit_next_day = as.factor(profit_next_day)) %>%
na.omit()

Random forest only predicting one kind of class in tidymodels

I have the following code:
library(tidymodels)
library(tidyverse)
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympics <- olympics %>%
drop_na()
olympics %>%
filter(medal == "Gold") %>%
group_by(team, medal) %>%
summarize(n = n()) %>%
ungroup() %>%
top_n(10, n)
olympics <- olympics %>%
select(sex, age, height, weight, year, season, sport, medal)
split <- initial_split(olympics)
train_data <- training(split)
test_data <- testing(split)
prep_recipe <- recipe(medal ~ ., data = train_data) %>%
step_normalize(all_numeric()) %>%
step_zv(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
prep()
training_preproc <- juice(prep_recipe)
testing_preproc <- bake(prep_recipe, test_data)
training_preproc %>%
count(medal)
#-----
dt_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("classification")
fdt_fitted <- dt_spec %>%
fit(medal ~ ., data = training_preproc) #entra l'especificacio
predict(dt_fitted, training_preproc) %>%
count(.pred_class)
predict(dt_fitted, training_preproc) %>%
bind_cols(training_preproc$medal) %>%
mutate(correct = ifelse(.pred_class == ...2, 1, 0)) %>%
summarize(sum(correct))
nrow(training_preproc)
The random forest spec is giving me the following output:
1) root 22635 15003 Bronze (0.3371769 0.3354539 0.3273691)
Basically, giving only Bronze to everything. However, counting the number of medals in the training dataset, I get
medal n
<fct> <int>
1 Bronze 7632
2 Gold 7593
3 Silver 7410
So I don't know why is giving me these poor predictions, maybe I'm doing something wrong here? If the number of Bronze was much greater than the other classes I could understand it, but with these results I can't.

Getting summaries from multiple simple linear regressions in R

I am performing simple linear regressions on multiple groups from my data set. However, I want extract the summaries from each of these regressions and put them into a master table by organized by group. I can run it like this (and it works):
fit_basic <- rs2_anova %>% #Run multiple simple linear regressions
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .)))
fit_basic_A <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_B <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_table_A <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "A")
fit_basic_table_B <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "B")
fit_basic_table <- rbind(fit_basic_table_A, fit_basic_table_B)
To save myself some lines of code (because I have many more groups than presented here) I thought I could use the map function, but I keep getting stuck at mapping the summary table, which throws an error:
fit_basic <- rs2_anova %>%
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .))) %>%
mutate(summaries = map(data, get_regression_table(.$model)))
Error in input_checks(model, digits, print) :
Only simple linear regression models are supported. Try again using only `lm()` models as appropriate.
I also tried something along this line:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
})
But I get a list of dataframes that I can't breakdown into a single dataframe and I have lost my group designations. I have tried:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
map_df(as_tibble, .id = "id")
and
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
unnest(cols = "id")
Any thoughts on how to automate this?
*Random test dataframe:
quant_method <- c("A", "A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
treatment <- c("x","x","x","x","x","y","y","y","y","y","x","x","x","x","x","y","y","y","y","y")
recoveries <-c("88","86","87","82","85","76","65","55","72","71","98","96","97","92","99","66",
"55","55","62","61")
rs2_anova <- data.frame(quant_method, treatment, recoveries)
Here is one solution using tidyverse and broom packages. It is slightly different from the purrr method you were attempting, but I think the result shows the terms that you were interested in extracting from the lm object (i.e., term and estimate).
library(tidyverse)
library(broom)
#Added the stringsAsFactors argument = F to avoid an error in the lm model
rs2_anova <- data.frame(quant_method,
treatment,
recoveries,
stringsAsFactors = F)
fit_basic <- rs2_anova %>%
#Group by quant_method column
group_by(quant_method) %>%
#do the linear models by grouping var
do(model = lm(recoveries ~ treatment, data = .)) %>%
#tidy lm object and order it as tibble
tidy(model)
I found the answer here: unnest a list column after modeling after group_by in r
and modified it to:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
summarise(model = list(broom::tidy(lm(recoveries ~ treatment)))))) %>%
unnest(data) %>%
unnest(model)
or to get all estimates, predictions and summaries (https://drsimonj.svbtle.com/running-a-model-on-separate-groups). This also works nicely:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(fit = map(data, ~ lm(loss_abs_BC_2 ~ cec, data = .)),
parameters = map(fit, tidy), #provides estimate table for slope and y-intercept with std.error and estimate p-values
summary = map(fit, broom::glance), #provides R2, adj.R2, sigma, and model p-value
predictions = map(fit, augment)) %>% #provides fitted values with residuals and errors
unnest(parameters) %>%
pivot_wider(names_from = term, values_from = c(estimate, std.error, statistic, p.value)) %>%
unnest(summary) %>%
unnest(predictions)

Can group by and tidy be used to fit multilevel models (lme) to repeated measures/longitudinal data?

Consider the following example longitudinal/repeated measures dataset
library(tidyverse)
library(broom)
library(nlme)
data <- read.csv("https://stats.idre.ucla.edu/stat/data/study2.csv")
data <- data %>% mutate(dbp =rnorm(120, 30:150), sbp = rnorm(120, 50:200),bmi
= rnorm(120,15:40), chol = rnorm(120,50:350), insulin = rnorm(120,2:40), educ = rnorm(120,5:10))
I can use group_by %>% do(tidy(*)) to run several unadjusted and adjusted single-level regression models (looping through list of outcomes & exposures) and extract model results to the data frame as follows
out <-c("pulse","insulin","chol")
exp <- c("factor(exertype)","sbp","dbp")
conf <- c("bmi","factor(diet)")
#Unadjusted models - single level regression (lm)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
#Adjusted models - single level regression (lm)
###############################################
Adjusted <- expand.grid(out, exp, conf) %>%
group_by(Var1, Var2) %>%
summarise(Var3 = paste0(Var3, collapse = "+")) %>%
rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2, "+", Var3)) %>%
group_by(model_id = row_number(), frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I would like to use the same process to fit multilevel models to account for the repeated data. using the example code:
lme(sbp ~ pulse+factor(diet)+time, data=data, random= ~time|id, method ="ML")
HOWEVER, when I try to do this, e.g. using:
#Unadjusted models - multi-level regression (lme)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lme(.$frm, data = data, random= ~time|id, method = "ML"))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I get the following error message:
Error in UseMethod("lme") : no applicable method for 'lme' applied to an object of class "character"
Any ideas on how to get this working for lme type models?
Unlike lm, lme will not accept a formula as a character value. You need to explicitly cast it to a formula. Try just adding as.formula()
do(tidy(lme(as.formula(.$frm), ...))) %>%

many models grouped modelr::add_predictions

I would like to use data labeled as train to fit a model then use data labeled as test to predict new values. I would like to do this in a "many models" scenario.
The following is my current set up. My problem is that I am training and adding predictions to all of the data. I don't know how to discriminate using modelr
library(modelr)
library(tidyverse)
library(gapminder)
# nest data by continent and label test/train data
nested_gap <- gapminder %>%
mutate(test_train = ifelse(year < 1992, "train", "test")) %>%
group_by(continent) %>%
nest()
# make a linear model function
cont_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
# fit a model and add predictions to all data
fitted_gap <- nested_gap %>%
mutate(model = map(data, cont_model)) %>%
mutate(pred = map2(data, model, add_predictions))
This was the solution provided by #shuckle
library(modelr)
library(tidyverse)
library(gapminder)
# nest data by continent and label test/train data
nested_gap <- gapminder %>%
mutate(test_train = ifelse(year < 1992, "train", "test")) %>%
group_by(continent) %>%
nest()
# make a linear model function than only trains on training set
cont_model <- function(df) {
lm(lifeExp ~ year, data = df %>% filter(test_train == "train"))
}
# fit a model and add predictions to all data
fitted_gap <- nested_gap %>%
mutate(model = map(data, cont_model)) %>%
mutate(pred = map2(data, model, add_predictions))
# unnest predictions and filter only the test rows
fitted_gap %>%
unnest(pred) %>%
filter(test_train == "test")

Resources