Getting summaries from multiple simple linear regressions in R - r

I am performing simple linear regressions on multiple groups from my data set. However, I want extract the summaries from each of these regressions and put them into a master table by organized by group. I can run it like this (and it works):
fit_basic <- rs2_anova %>% #Run multiple simple linear regressions
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .)))
fit_basic_A <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_B <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_table_A <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "A")
fit_basic_table_B <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "B")
fit_basic_table <- rbind(fit_basic_table_A, fit_basic_table_B)
To save myself some lines of code (because I have many more groups than presented here) I thought I could use the map function, but I keep getting stuck at mapping the summary table, which throws an error:
fit_basic <- rs2_anova %>%
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .))) %>%
mutate(summaries = map(data, get_regression_table(.$model)))
Error in input_checks(model, digits, print) :
Only simple linear regression models are supported. Try again using only `lm()` models as appropriate.
I also tried something along this line:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
})
But I get a list of dataframes that I can't breakdown into a single dataframe and I have lost my group designations. I have tried:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
map_df(as_tibble, .id = "id")
and
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
unnest(cols = "id")
Any thoughts on how to automate this?
*Random test dataframe:
quant_method <- c("A", "A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
treatment <- c("x","x","x","x","x","y","y","y","y","y","x","x","x","x","x","y","y","y","y","y")
recoveries <-c("88","86","87","82","85","76","65","55","72","71","98","96","97","92","99","66",
"55","55","62","61")
rs2_anova <- data.frame(quant_method, treatment, recoveries)

Here is one solution using tidyverse and broom packages. It is slightly different from the purrr method you were attempting, but I think the result shows the terms that you were interested in extracting from the lm object (i.e., term and estimate).
library(tidyverse)
library(broom)
#Added the stringsAsFactors argument = F to avoid an error in the lm model
rs2_anova <- data.frame(quant_method,
treatment,
recoveries,
stringsAsFactors = F)
fit_basic <- rs2_anova %>%
#Group by quant_method column
group_by(quant_method) %>%
#do the linear models by grouping var
do(model = lm(recoveries ~ treatment, data = .)) %>%
#tidy lm object and order it as tibble
tidy(model)

I found the answer here: unnest a list column after modeling after group_by in r
and modified it to:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
summarise(model = list(broom::tidy(lm(recoveries ~ treatment)))))) %>%
unnest(data) %>%
unnest(model)
or to get all estimates, predictions and summaries (https://drsimonj.svbtle.com/running-a-model-on-separate-groups). This also works nicely:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(fit = map(data, ~ lm(loss_abs_BC_2 ~ cec, data = .)),
parameters = map(fit, tidy), #provides estimate table for slope and y-intercept with std.error and estimate p-values
summary = map(fit, broom::glance), #provides R2, adj.R2, sigma, and model p-value
predictions = map(fit, augment)) %>% #provides fitted values with residuals and errors
unnest(parameters) %>%
pivot_wider(names_from = term, values_from = c(estimate, std.error, statistic, p.value)) %>%
unnest(summary) %>%
unnest(predictions)

Related

How to use purrr to pluck/keep some elements from a list of linear regression fit objects?

I have a list of linear regression fit objects. Let's create it in this example by:
c('hp', 'wt', 'disp') %>%
paste('mpg ~', .) %>%
map(as.formula) %>%
map(lm, data = mtcars)
What I would like is to keep just the residuals and fitted.values from each of the regression fit objects, within this same pipeline. I was trying to use the keep function, but it doesn't work:
c('hp', 'wt', 'disp') %>%
paste('mpg ~', .) %>%
map(as.formula) %>%
map(lm, data = mtcars) %>%
map(keep, names(.) %in% c("residuals", "fitted.values"))
Error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
How can I perform this action?
If a data frame is wanted as output then use the code below or if a list is wanted omit the bind_rows line.
library(dplyr)
library(purrr)
nms <- c('hp', 'wt', 'disp')
out <- nms %>%
set_names(x = map(paste('mpg ~', .), as.formula)) %>%
map(lm, data = mtcars) %>%
map(~ data.frame(fit = fitted(.), resid = resid(.))) %>%
bind_rows(.id = "id")
We can simplify this slightly using sapply as it will add names and use reformulate to generate the formula.
out <- nms %>%
sapply(reformulate, response = "mpg") %>%
map(lm, data = mtcars) %>%
map(~ data.frame(fit = fitted(.), resid = resid(.))) %>%
bind_rows(.id = "id")

performing linear regression with interaction term (purrr package)

I am analyzing the data using ANCOVA method and also trying to run several multiple linear regression from the purrr package.
The model should be:
m1 <- lm(PD~SR*Type, data = df (with interaction term)
m2 <- lm(PD~SR+Type, data = df (without interaction term)
My example code
library(broom)
library(purrr)
df <- data.frame(PD=c(10,20,30,40,50,10,20,33,12,52,21,43),
SR=c(5,10,20,24,6,21,59,22,1,11,12,3),
n=c("tree1", "tree1", "tree1",
"tree2", "tree2","tree2",
"tree3", "tree3", "tree3",
"tree4", "tree4","tree4"),
Type=c("a", "b",'c',
"a", "b",'c',
"a", "b",'c',
"a", "b",'c'))
x <- df %>%
nest(data = -n) %>%
mutate(fit = map(data, ~lm(PD~SR * Type, data= .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
tidy <- x %>% unnest(data) %>%
select(n, Type, PD, SR, tidied) %>%
unnest(tidied) %>%
select(n, Type, PD, SR, term, estimate) %>%
rename(, species_richness = SR)%>%
spread(term, estimate)
My problem is that I am expecting the intercept/slope is different to each Type.
However, the intercept/slope given by the code is the same for each Type.
You can check from tidy dataframe.
Any suggestions for this point? I am learning how to use the purrr package since it is quite powerfull.
or can u have other methods to do that?

multiple models: how to select best model and make prediction

My task is to create many models, choose model that predict best and pass data to this model for prediction. Example inspired from R for data science book
library(modelr)
library(tidyverse)
library(gapminder)
gapminder
country_model1 <- function(df) {lm(lifeExp ~ year, data = df)}
country_model2 <- function(df) {lm(lifeExp ~ year+gdpPercap, data = df)}
country_model3 <- function(df) {lm(lifeExp ~ year+gdpPercap+pop, data = df)}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(model1 = map(data, country_model1),
model2 = map(data, country_model2),
model3 = map(data, country_model3))
So I have 3 models for each country.
I can find r squared for each model, but stopped here :(
r_sq <- by_country %>%
mutate(glance1 = map(model1, broom::glance),
glance2 = map(model2, broom::glance),
glance3 = map(model3, broom::glance)) %>%
unnest(glance1:glance3, .drop = TRUE) %>%
select(country, continent, starts_with('r.sq'))
How to in tidy way:
select which of 3 make better prediction for each particular country?
pass new data to chosen model and have prediction back?
We can identify the model with the highest r^2 for for each country like this:
best_fits <- r_sq %>%
pivot_longer(-c(country, continent), names_to = "r_sq_version") %>%
group_by(country, continent) %>%
slice_max(value) %>%
ungroup()
Not too surprisingly, the third model (called here r.squared2 from its name in r_sq) consistently provides the highest correlation, since that model takes more inputs and has more degrees of freedom.
Let's make some new data, taking the original but adding 100 years to the dates.
by_country_new <- gapminder %>%
group_by(country, continent) %>%
mutate(year = year + 100,
gdpPercap = gdpPercap,
pop = pop) %>%
select(-lifeExp) %>% # Presumably we don't know this and are trying to predict using known data
nest()
We could then apply the best model for each country to the new data: (Thanks to #mrflick for https://stackoverflow.com/a/63201855/6851825)
best_fits %>%
left_join(by_country) %>%
left_join(by_country_new, by = c("country", "continent")) %>%
mutate(best_model = case_when(
r_sq_version == "r.squared2" ~ model3,
r_sq_version == "r.squared1" ~ model2,
r_sq_version == "r.squared" ~ model1,
)) %>%
select(-c(model1:model3)) %>%
mutate(prediction = map2(best_model, data.y,
~broom::augment(.x, newdata = .y))) -> new_fits
We can then see how these predictions look like a continuation of the time trend established in the original data (with some other variation due to changes in population and gdp in our new data).
new_predictions <- new_fits %>%
filter(country == "Afghanistan") %>%
select(prediction) %>%
unnest_wider(prediction) %>%
flatten_dfr() %>%
rename(lifeExp = ".fitted")
gapminder %>%
filter(country == "Afghanistan") %>%
bind_rows(new_predictions) %>%
ggplot(aes(year, lifeExp)) +
geom_point() +
labs(title = "Afghanistan extrapolated lifeExp")

Using map() in a split-apply-combine to run multiple regressions with regression weights

Say I have some data that looks like this:
N <- 200
X <- sample(letters[1:5],N, replace = T)
Y <- rnorm(N)
W <- abs(rnorm(N))
my_data <- tibble(X, Y, W)
I want to run an intercept-only regression on each subset of my data defined by X. To do so, I use nest(), mutate() and map() like so:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x)))
While this code works, when I try to incorporate regression weights, like this:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = W)))
I get the following error:
Error: Problem with `mutate()` input `fit`.
x missing or negative weights not allowed
ℹ Input `fit` is `map(data, ~lm(Y ~ 1, data = .x, weights = W))`.
ℹ The error occurred in group 1: X = "a".
Run `rlang::last_error()` to see where the error occurred.
Where am I going wrong?
(Disclaimer: I am new to the tidyverse so am likely doing something dumb)
Write it this way:
my_data %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = .x$W)))
The point is that at that point W is not an existing column anymore.
It exists only inside data. With map you are looping over data which at that point is a list of dataframes.
Therefore to call W, you need to call it through $
Also, define W as:
W <- abs(rnorm(N))
Because you can't have negative weights.
Alternatively, you can do it as follow:
my_data %>%
group_by(X) %>%
summarise(fit = list(lm(Y ~ 1, weights = W)))
It will give you the same result. [almost: because X will be sorted]
Check this out:
fit1 <- my_data %>%
arrange(X) %>%
group_by(X) %>%
nest() %>%
mutate(fit = map(data, ~lm(Y ~ 1, data = .x, weights = .x$W))) %>%
pull(fit)
fit2 <- my_data %>%
group_by(X) %>%
summarise(fit = list(lm(Y ~ 1, weights = W))) %>%
pull(fit)
identical(map(fit1, coef), map(fit2, coef))
#> TRUE
If you just need the coefficients, you can do it this way:
my_data %>%
group_by(X) %>%
summarise(fit = coef(lm(Y ~ 1, weights = W)))

Can group by and tidy be used to fit multilevel models (lme) to repeated measures/longitudinal data?

Consider the following example longitudinal/repeated measures dataset
library(tidyverse)
library(broom)
library(nlme)
data <- read.csv("https://stats.idre.ucla.edu/stat/data/study2.csv")
data <- data %>% mutate(dbp =rnorm(120, 30:150), sbp = rnorm(120, 50:200),bmi
= rnorm(120,15:40), chol = rnorm(120,50:350), insulin = rnorm(120,2:40), educ = rnorm(120,5:10))
I can use group_by %>% do(tidy(*)) to run several unadjusted and adjusted single-level regression models (looping through list of outcomes & exposures) and extract model results to the data frame as follows
out <-c("pulse","insulin","chol")
exp <- c("factor(exertype)","sbp","dbp")
conf <- c("bmi","factor(diet)")
#Unadjusted models - single level regression (lm)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
#Adjusted models - single level regression (lm)
###############################################
Adjusted <- expand.grid(out, exp, conf) %>%
group_by(Var1, Var2) %>%
summarise(Var3 = paste0(Var3, collapse = "+")) %>%
rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2, "+", Var3)) %>%
group_by(model_id = row_number(), frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I would like to use the same process to fit multilevel models to account for the repeated data. using the example code:
lme(sbp ~ pulse+factor(diet)+time, data=data, random= ~time|id, method ="ML")
HOWEVER, when I try to do this, e.g. using:
#Unadjusted models - multi-level regression (lme)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lme(.$frm, data = data, random= ~time|id, method = "ML"))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I get the following error message:
Error in UseMethod("lme") : no applicable method for 'lme' applied to an object of class "character"
Any ideas on how to get this working for lme type models?
Unlike lm, lme will not accept a formula as a character value. You need to explicitly cast it to a formula. Try just adding as.formula()
do(tidy(lme(as.formula(.$frm), ...))) %>%

Resources