many models grouped modelr::add_predictions - r

I would like to use data labeled as train to fit a model then use data labeled as test to predict new values. I would like to do this in a "many models" scenario.
The following is my current set up. My problem is that I am training and adding predictions to all of the data. I don't know how to discriminate using modelr
library(modelr)
library(tidyverse)
library(gapminder)
# nest data by continent and label test/train data
nested_gap <- gapminder %>%
mutate(test_train = ifelse(year < 1992, "train", "test")) %>%
group_by(continent) %>%
nest()
# make a linear model function
cont_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
# fit a model and add predictions to all data
fitted_gap <- nested_gap %>%
mutate(model = map(data, cont_model)) %>%
mutate(pred = map2(data, model, add_predictions))

This was the solution provided by #shuckle
library(modelr)
library(tidyverse)
library(gapminder)
# nest data by continent and label test/train data
nested_gap <- gapminder %>%
mutate(test_train = ifelse(year < 1992, "train", "test")) %>%
group_by(continent) %>%
nest()
# make a linear model function than only trains on training set
cont_model <- function(df) {
lm(lifeExp ~ year, data = df %>% filter(test_train == "train"))
}
# fit a model and add predictions to all data
fitted_gap <- nested_gap %>%
mutate(model = map(data, cont_model)) %>%
mutate(pred = map2(data, model, add_predictions))
# unnest predictions and filter only the test rows
fitted_gap %>%
unnest(pred) %>%
filter(test_train == "test")

Related

performing linear regression with interaction term (purrr package)

I am analyzing the data using ANCOVA method and also trying to run several multiple linear regression from the purrr package.
The model should be:
m1 <- lm(PD~SR*Type, data = df (with interaction term)
m2 <- lm(PD~SR+Type, data = df (without interaction term)
My example code
library(broom)
library(purrr)
df <- data.frame(PD=c(10,20,30,40,50,10,20,33,12,52,21,43),
SR=c(5,10,20,24,6,21,59,22,1,11,12,3),
n=c("tree1", "tree1", "tree1",
"tree2", "tree2","tree2",
"tree3", "tree3", "tree3",
"tree4", "tree4","tree4"),
Type=c("a", "b",'c',
"a", "b",'c',
"a", "b",'c',
"a", "b",'c'))
x <- df %>%
nest(data = -n) %>%
mutate(fit = map(data, ~lm(PD~SR * Type, data= .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
tidy <- x %>% unnest(data) %>%
select(n, Type, PD, SR, tidied) %>%
unnest(tidied) %>%
select(n, Type, PD, SR, term, estimate) %>%
rename(, species_richness = SR)%>%
spread(term, estimate)
My problem is that I am expecting the intercept/slope is different to each Type.
However, the intercept/slope given by the code is the same for each Type.
You can check from tidy dataframe.
Any suggestions for this point? I am learning how to use the purrr package since it is quite powerfull.
or can u have other methods to do that?

Using PDP with nested GBM's with map function

I have a nested GBM, and am looking to extract the partial depndence, tryingto use the following query:
library(rsample) # data splitting
library(gbm) # basic implementation
library(xgboost) # a faster implementation of gbm
library(caret) # an aggregator package for performing many machine learning models
library(h2o) # a java-based platform
library(pdp) # model visualization
basic_gbm <- function(data) {
mymodel <- gbm(formula = mpg ~ . ,
distribution = "gaussian",
data = data ,
n.minobsinnode = 1,
bag.fraction = 1
)
return(mymodel)
}
blah_model <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm))
blah_summary <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm)) %>%
mutate(summary = map(model, summary)) %>%
mutate(all_data = pmap(list(data, summary), .f =left_join, by = character())) %>%
select(cols=c(all_data)) %>%
unnest(cols = c(cols)) %>%
ungroup()
blah_model %>%
left_join(blah_summary, by = character()) %>%
mutate(pred = map(model, partial, pred.var = var, n.trees = model$n.trees, train = data)) -- this does not work
This does work and is what I would want as a nested df for each var:
coeffs <- blah_model$model[[1]] %>%
partial(pred.var = 'disp', n.trees = blah_model$model[[1]]$n.trees, train = blah_model$data[[1]])
However, it is saying it is not finding the variables in the training data - the data I am passing through is the training data. The var in the map is from the summary functions - these are prediction variables.
I gave a better example

multiple models: how to select best model and make prediction

My task is to create many models, choose model that predict best and pass data to this model for prediction. Example inspired from R for data science book
library(modelr)
library(tidyverse)
library(gapminder)
gapminder
country_model1 <- function(df) {lm(lifeExp ~ year, data = df)}
country_model2 <- function(df) {lm(lifeExp ~ year+gdpPercap, data = df)}
country_model3 <- function(df) {lm(lifeExp ~ year+gdpPercap+pop, data = df)}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(model1 = map(data, country_model1),
model2 = map(data, country_model2),
model3 = map(data, country_model3))
So I have 3 models for each country.
I can find r squared for each model, but stopped here :(
r_sq <- by_country %>%
mutate(glance1 = map(model1, broom::glance),
glance2 = map(model2, broom::glance),
glance3 = map(model3, broom::glance)) %>%
unnest(glance1:glance3, .drop = TRUE) %>%
select(country, continent, starts_with('r.sq'))
How to in tidy way:
select which of 3 make better prediction for each particular country?
pass new data to chosen model and have prediction back?
We can identify the model with the highest r^2 for for each country like this:
best_fits <- r_sq %>%
pivot_longer(-c(country, continent), names_to = "r_sq_version") %>%
group_by(country, continent) %>%
slice_max(value) %>%
ungroup()
Not too surprisingly, the third model (called here r.squared2 from its name in r_sq) consistently provides the highest correlation, since that model takes more inputs and has more degrees of freedom.
Let's make some new data, taking the original but adding 100 years to the dates.
by_country_new <- gapminder %>%
group_by(country, continent) %>%
mutate(year = year + 100,
gdpPercap = gdpPercap,
pop = pop) %>%
select(-lifeExp) %>% # Presumably we don't know this and are trying to predict using known data
nest()
We could then apply the best model for each country to the new data: (Thanks to #mrflick for https://stackoverflow.com/a/63201855/6851825)
best_fits %>%
left_join(by_country) %>%
left_join(by_country_new, by = c("country", "continent")) %>%
mutate(best_model = case_when(
r_sq_version == "r.squared2" ~ model3,
r_sq_version == "r.squared1" ~ model2,
r_sq_version == "r.squared" ~ model1,
)) %>%
select(-c(model1:model3)) %>%
mutate(prediction = map2(best_model, data.y,
~broom::augment(.x, newdata = .y))) -> new_fits
We can then see how these predictions look like a continuation of the time trend established in the original data (with some other variation due to changes in population and gdp in our new data).
new_predictions <- new_fits %>%
filter(country == "Afghanistan") %>%
select(prediction) %>%
unnest_wider(prediction) %>%
flatten_dfr() %>%
rename(lifeExp = ".fitted")
gapminder %>%
filter(country == "Afghanistan") %>%
bind_rows(new_predictions) %>%
ggplot(aes(year, lifeExp)) +
geom_point() +
labs(title = "Afghanistan extrapolated lifeExp")

Getting summaries from multiple simple linear regressions in R

I am performing simple linear regressions on multiple groups from my data set. However, I want extract the summaries from each of these regressions and put them into a master table by organized by group. I can run it like this (and it works):
fit_basic <- rs2_anova %>% #Run multiple simple linear regressions
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .)))
fit_basic_A <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_B <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_table_A <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "A")
fit_basic_table_B <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "B")
fit_basic_table <- rbind(fit_basic_table_A, fit_basic_table_B)
To save myself some lines of code (because I have many more groups than presented here) I thought I could use the map function, but I keep getting stuck at mapping the summary table, which throws an error:
fit_basic <- rs2_anova %>%
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .))) %>%
mutate(summaries = map(data, get_regression_table(.$model)))
Error in input_checks(model, digits, print) :
Only simple linear regression models are supported. Try again using only `lm()` models as appropriate.
I also tried something along this line:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
})
But I get a list of dataframes that I can't breakdown into a single dataframe and I have lost my group designations. I have tried:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
map_df(as_tibble, .id = "id")
and
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
unnest(cols = "id")
Any thoughts on how to automate this?
*Random test dataframe:
quant_method <- c("A", "A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
treatment <- c("x","x","x","x","x","y","y","y","y","y","x","x","x","x","x","y","y","y","y","y")
recoveries <-c("88","86","87","82","85","76","65","55","72","71","98","96","97","92","99","66",
"55","55","62","61")
rs2_anova <- data.frame(quant_method, treatment, recoveries)
Here is one solution using tidyverse and broom packages. It is slightly different from the purrr method you were attempting, but I think the result shows the terms that you were interested in extracting from the lm object (i.e., term and estimate).
library(tidyverse)
library(broom)
#Added the stringsAsFactors argument = F to avoid an error in the lm model
rs2_anova <- data.frame(quant_method,
treatment,
recoveries,
stringsAsFactors = F)
fit_basic <- rs2_anova %>%
#Group by quant_method column
group_by(quant_method) %>%
#do the linear models by grouping var
do(model = lm(recoveries ~ treatment, data = .)) %>%
#tidy lm object and order it as tibble
tidy(model)
I found the answer here: unnest a list column after modeling after group_by in r
and modified it to:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
summarise(model = list(broom::tidy(lm(recoveries ~ treatment)))))) %>%
unnest(data) %>%
unnest(model)
or to get all estimates, predictions and summaries (https://drsimonj.svbtle.com/running-a-model-on-separate-groups). This also works nicely:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(fit = map(data, ~ lm(loss_abs_BC_2 ~ cec, data = .)),
parameters = map(fit, tidy), #provides estimate table for slope and y-intercept with std.error and estimate p-values
summary = map(fit, broom::glance), #provides R2, adj.R2, sigma, and model p-value
predictions = map(fit, augment)) %>% #provides fitted values with residuals and errors
unnest(parameters) %>%
pivot_wider(names_from = term, values_from = c(estimate, std.error, statistic, p.value)) %>%
unnest(summary) %>%
unnest(predictions)

Can group by and tidy be used to fit multilevel models (lme) to repeated measures/longitudinal data?

Consider the following example longitudinal/repeated measures dataset
library(tidyverse)
library(broom)
library(nlme)
data <- read.csv("https://stats.idre.ucla.edu/stat/data/study2.csv")
data <- data %>% mutate(dbp =rnorm(120, 30:150), sbp = rnorm(120, 50:200),bmi
= rnorm(120,15:40), chol = rnorm(120,50:350), insulin = rnorm(120,2:40), educ = rnorm(120,5:10))
I can use group_by %>% do(tidy(*)) to run several unadjusted and adjusted single-level regression models (looping through list of outcomes & exposures) and extract model results to the data frame as follows
out <-c("pulse","insulin","chol")
exp <- c("factor(exertype)","sbp","dbp")
conf <- c("bmi","factor(diet)")
#Unadjusted models - single level regression (lm)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
#Adjusted models - single level regression (lm)
###############################################
Adjusted <- expand.grid(out, exp, conf) %>%
group_by(Var1, Var2) %>%
summarise(Var3 = paste0(Var3, collapse = "+")) %>%
rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2, "+", Var3)) %>%
group_by(model_id = row_number(), frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I would like to use the same process to fit multilevel models to account for the repeated data. using the example code:
lme(sbp ~ pulse+factor(diet)+time, data=data, random= ~time|id, method ="ML")
HOWEVER, when I try to do this, e.g. using:
#Unadjusted models - multi-level regression (lme)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lme(.$frm, data = data, random= ~time|id, method = "ML"))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I get the following error message:
Error in UseMethod("lme") : no applicable method for 'lme' applied to an object of class "character"
Any ideas on how to get this working for lme type models?
Unlike lm, lme will not accept a formula as a character value. You need to explicitly cast it to a formula. Try just adding as.formula()
do(tidy(lme(as.formula(.$frm), ...))) %>%

Resources