I have made a graph that displays r2, p-value and equation from linear regressions in the top left corner using stat_poly_eq.
Now I wish to have the stats from the linear regression extracted into a table.
For an example, in the mtcars dataset, if I want to do linear regression on plots of hp against disp for each cylinder group (e.g. 4, 6, 8) and then extract the linear regression stats into a table, how could I do that?
Thanks!
Here's the graph I have:
library(ggplot2)
library(ggpmisc)
formula <- y~x
ggplot(mtcars, aes(disp, hp)) +
geom_point() +
geom_smooth(method = "lm",formula = formula) +
theme_bw()+
facet_wrap(~cyl, scales = "free")+
stat_poly_eq(
aes(label = paste(stat(adj.rr.label), stat(eq.label), stat(p.value.label), sep = "*\", \"*")),
formula = formula, parse = TRUE, size=3)
Do you mean something like this?
With nest_by, divide the rest of the columns in separated tibbles by each cyl
With summarise, calculate each lm. You need to set it into a list.
Operate like a normal list with map and calculate the stuff you need: coefficients (extractable with broom::tidy) and adj.r.squared (with summary(.)$adj.r.squared)
unnest the result of broom::tidy to make a unique tibble.
library(dplyr)
library(tidyr)
library(purrr)
mtcars %>%
nest_by(cyl) %>%
summarise(mdl = list(lm(hp ~ disp, data)), .groups = "drop") %>%
mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
mdl = map(mdl, broom::tidy)) %>%
unnest(mdl)
#> # A tibble: 6 x 7
#> cyl term estimate std.error statistic p.value adjrsquared
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 47.0 25.3 1.86 0.0960 0.0988
#> 2 4 disp 0.339 0.234 1.45 0.182 0.0988
#> 3 6 (Intercept) 177. 42.0 4.22 0.00829 0.117
#> 4 6 disp -0.300 0.224 -1.34 0.238 0.117
#> 5 8 (Intercept) 178. 77.4 2.30 0.0405 -0.0682
#> 6 8 disp 0.0890 0.216 0.413 0.687 -0.0682
Related
After doing ps matching, I'm running a poisson model like so:
model <- glm(outcome ~ x1 + x2 + x3 ... ,
data = d,
weights = psweights$weights,
family = "poisson")
And then want to create a new data frame with the variable names, coefficients, and upper and lower confidence intervals. Just doing:
d2 <- summary(model)$coef
gets me the variable names, coefficients, standard errors, and z values. What is the easiest way to compute confidence intervals, convert them into columns and bind it all into one data frame?
How about this, using the broom package:
library(broom)
mod <- glm(hp ~ disp + drat + cyl, data=mtcars, family=poisson)
tidy(mod, conf.int=TRUE)
#> # A tibble: 4 × 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 2.40 0.196 12.3 1.30e-34 2.02 2.79
#> 2 disp 0.000766 0.000259 2.96 3.07e- 3 0.000258 0.00127
#> 3 drat 0.240 0.0386 6.22 4.89e-10 0.164 0.315
#> 4 cyl 0.236 0.0195 12.1 1.21e-33 0.198 0.274
Created on 2022-06-30 by the reprex package (v2.0.1)
Using the data set mtcars as an example: The goal is to write a function to run multiple regression models with changing independent variables and changing dependent variables.
In the code that I wrote (below), var are the independent variables and mpg is the independent variable. I used map to run regressions repeatedly with vs and am as the changing independent variables each time.
var = c("vs", "am")
mtcars %>% select(all_of(var)) %>%
map(~ glm(mpg ~ .x + cyl + disp + splines::ns(wt, 2) + hp,
family = gaussian(link = "identity"),
data = mtcars)) %>%
map_dfr(tidy, conf.int = T, .id = 'source') %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
I would like to run the same regression with a different set of independent variables, and also with a y that I can specify (e.g., I ran with mpg above, and I would like to change it to qsec or some other variables). So I envision a function like this:
function_name <- function(x, y, dataset){
dataset %>% select(all_of(x)) %>%
map(~ glm(y ~ .x + cyl + disp + splines::ns(wt, 2) + hp,
family = gaussian(link = "identity"),
data = dataset)) %>%
map_dfr(tidy, conf.int = T, .id = 'source') %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
}
But this function didn't work. Any suggestions?
You could achieve your desired result like so:
The issue with your code is that y ~ ... will not work. Instead you could use reformulate (or as.formula) to dynamically create the formula for your regression model.
To make this work loop directly over the character vector x or more more precisely setNames(x, x) instead of looping over dataset %>% select(all_of(x)).
library(dplyr)
library(purrr)
library(broom)
function_name <- function(x, y, dataset) {
map(setNames(x, x), ~ glm(reformulate(
termlabels = c(.x, "cyl", "disp", "splines::ns(wt, 2)", "hp"),
response = y
),
family = gaussian(link = "identity"),
data = dataset
)) %>%
map_dfr(tidy, conf.int = T, .id = "source") %>%
select(source, source, term, estimate, std.error, conf.low, conf.high, p.value)
}
var <- c("vs", "am")
function_name(x = var, y = "mpg", mtcars)
#> # A tibble: 14 × 7
#> source term estimate std.error conf.low conf.high p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 vs (Intercept) 32.7 3.49 25.8 39.5 1.24e- 9
#> 2 vs vs 1.03 1.52 -1.95 4.01 5.05e- 1
#> 3 vs cyl -0.187 0.821 -1.80 1.42 8.21e- 1
#> 4 vs disp 0.000545 0.0119 -0.0228 0.0239 9.64e- 1
#> 5 vs splines::ns(wt, 2)1 -22.4 4.82 -31.9 -13.0 9.02e- 5
#> 6 vs splines::ns(wt, 2)2 -9.48 3.16 -15.7 -3.28 6.09e- 3
#> 7 vs hp -0.0202 0.0115 -0.0427 0.00226 9.02e- 2
#> 8 am (Intercept) 34.6 2.65 29.4 39.8 1.15e-12
#> 9 am am 0.0113 1.57 -3.06 3.08 9.94e- 1
#> 10 am cyl -0.470 0.714 -1.87 0.931 5.17e- 1
#> 11 am disp 0.000796 0.0125 -0.0236 0.0252 9.50e- 1
#> 12 am splines::ns(wt, 2)1 -21.5 5.86 -33.0 -10.0 1.14e- 3
#> 13 am splines::ns(wt, 2)2 -9.21 3.34 -15.8 -2.66 1.07e- 2
#> 14 am hp -0.0214 0.0136 -0.0480 0.00527 1.28e- 1
I try to do 1104 linear regressions with the same model. My independent variables do not change. However, my dependent variable does. Indeed, I have 1104 dependent variables. I do not know how to extract all the coefficients (intercepts included) and p-values in order to compute means of each (coefficients & p-values). How to do that with an easy way ? This is my model :
testMCFG1 <- lapply(101:1204, function(i) lm(recexp[,i]~recexp[,"rm"] + recexp[,"zdy"] + recexp[,"ztbl"] + recexp[,"ztms"] + recexp[,"zdfy"] + recexp[,"rm_zdy"] + recexp[,"rm_ztbl"] + recexp[,"rm_ztms"] + recexp[,"rm_zdfy"] + recexp[,"contexte"] + recexp[,"rm_contexte"]))
However, someone here has already showed me how to do that with only one invariant independent variable. That works. Find below the codes for this case:
y <- 'rm'
x <- names(recexp[101:1204])
models <- map(setNames(x, x),
~ lm(as.formula(paste(.x, y, sep="~")),
data=recexp))
pvalues <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
results <- full_join(basic_information, pvalues)
results %>% group_by(term) %>% summarise(mean_estimate = mean(estimate))
results %>% group_by(term) %>% summarise(mean_p = mean(p.value))
Here is a solution using several tidyverse packages. You don't provide your data so I'll use mtcars as an example. Put your independent variables into a fixed string called independents and we'll grab your dependents using a slice as you did with your code producing a character vector
#####
independents <- 'mpg + vs + am + gear'
dependent <- names(mtcars[2:7])
Load the libraries
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
library(tibble)
Make a list of all the models using purrr::map
models <- map(setNames(dependent, dependent),
~ lm(as.formula(paste(.x, independents, sep="~")),
data=mtcars))
Take that list of lm models and feed it to broom::tidy to extract the basic information about beta estimates, and p values etc. To keep it neat use the name of the list item (which is the dependent variable) and add it as a column. Remove the parens from intercept and add a zero so it is always first and you know it's beta0
basics <-
map(models, ~ broom::tidy(.)) %>%
map2_df(.,
names(.),
~ mutate(.x, which_dependent = .y)) %>%
select(which_dependent, everything()) %>%
mutate(term = gsub("\\(Intercept\\)", "0Intercept", term))
Feed the list in again this time extract r squared and sigma a.k.a. "Residual standard error"
model_summary <-
data.frame(rsquared = unlist(map(models, ~ summary(.)$r.squared)),
RSE = unlist(map(models, ~ summary(.)$sigma))) %>%
rownames_to_column(var = "which_dependent")
Join the two based on which dependent variable
results <- full_join(basics, model_summary)
#> Joining, by = "which_dependent"
results
#> # A tibble: 30 x 8
#> which_dependent term estimate std.error statistic p.value rsquared RSE
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0Inter… 10.4 1.14 9.13 9.58e-10 0.861 0.714
#> 2 cyl mpg -0.117 0.0382 -3.06 4.98e- 3 0.861 0.714
#> 3 cyl vs -1.80 0.374 -4.81 5.09e- 5 0.861 0.714
#> 4 cyl am -0.414 0.502 -0.826 4.16e- 1 0.861 0.714
#> 5 cyl gear -0.258 0.290 -0.891 3.81e- 1 0.861 0.714
#> 6 disp 0Inter… 571. 94.1 6.07 1.76e- 6 0.804 58.8
#> 7 disp mpg -9.50 3.14 -3.02 5.47e- 3 0.804 58.8
#> 8 disp vs -85.9 30.8 -2.79 9.49e- 3 0.804 58.8
#> 9 disp am -31.9 41.3 -0.774 4.45e- 1 0.804 58.8
#> 10 disp gear -26.8 23.9 -1.12 2.71e- 1 0.804 58.8
#> # … with 20 more rows
It's in long format so you can do things like summarise grouped by term
results %>%
group_by(term) %>%
summarise(mean_p = mean(p.value)) %>%
arrange(term)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 2
#> term mean_p
#> <chr> <dbl>
#> 1 0Intercept 0.000168
#> 2 am 0.359
#> 3 gear 0.287
#> 4 mpg 0.0538
#> 5 vs 0.159
Or you can make it wider if you prefer...
wide_results <-
results %>%
pivot_wider(names_from = term,
values_from = estimate:p.value)
wide_results
#> # A tibble: 6 x 23
#> which_dependent rsquared RSE estimate_0Inter… estimate_mpg estimate_vs
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 cyl 0.861 0.714 10.4 -0.117 -1.80
#> 2 disp 0.804 58.8 571. -9.50 -85.9
#> 3 hp 0.736 37.7 241. -8.17 -41.4
#> 4 drat 0.667 0.331 2.07 0.0228 0.166
#> 5 wt 0.804 0.464 5.90 -0.104 -0.146
#> 6 qsec 0.734 0.988 17.5 0.0894 2.29
#> # … with 17 more variables: estimate_am <dbl>, estimate_gear <dbl>,
#> # std.error_0Intercept <dbl>, std.error_mpg <dbl>, std.error_vs <dbl>,
#> # std.error_am <dbl>, std.error_gear <dbl>, statistic_0Intercept <dbl>,
#> # statistic_mpg <dbl>, statistic_vs <dbl>, statistic_am <dbl>,
#> # statistic_gear <dbl>, p.value_0Intercept <dbl>, p.value_mpg <dbl>,
#> # p.value_vs <dbl>, p.value_am <dbl>, p.value_gear <dbl>
names(wide_results)
#> [1] "which_dependent" "rsquared" "RSE"
#> [4] "estimate_0Intercept" "estimate_mpg" "estimate_vs"
#> [7] "estimate_am" "estimate_gear" "std.error_0Intercept"
#> [10] "std.error_mpg" "std.error_vs" "std.error_am"
#> [13] "std.error_gear" "statistic_0Intercept" "statistic_mpg"
#> [16] "statistic_vs" "statistic_am" "statistic_gear"
#> [19] "p.value_0Intercept" "p.value_mpg" "p.value_vs"
#> [22] "p.value_am" "p.value_gear"
I am currently struggling to run weighted regression models on multiple variables in R.
When using (non-weighted) glm, I was successful by running the following:
mtcars_1 <- mtcars %>%
nest(-gear)%>%
mutate(model_0 = map(data, ~ glm(vs ~ drat, family = "binomial", data = .)))%>%
mutate(model_0_tidy = map(model_0, tidy))%>%
select(gear, model_0_tidy)%>%
ungroup()%>%
unnest(model_0_tidy)
That is I receive the following:
# A tibble: 6 x 6
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -15.3 22.6 -0.677 0.499
2 4 drat 4.26 5.76 0.740 0.459
3 3 (Intercept) -3.91 7.39 -0.529 0.597
4 3 drat 0.801 2.32 0.345 0.730
5 5 (Intercept) 5.20 14.4 0.362 0.718
6 5 drat -1.71 3.77 -0.453 0.651
However, when I would like to weight my observations and thus use svyglm from the survey-package, nesting does not work.
This was my approach:
design_0 <- svydesign(ids=~0, data = mtcars, weights = mtaars$wt)
mtcars_2 <- mtcars%>%
nest(-gear)%>%
mutate(model_1 = map(data, ~ svyglm(vs ~ drat, family = quasibinomial(logit), design = design_0, data = .)))%>%
mutate(model_1_tidy = map(model_1, tidy))%>%
select(gear, model_1_tidy)%>%
ungroup()%>%
unnest(model_1_tidy)
# If suggested that wt serves as frequency weight
# Outcome
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -8.12 3.88 -2.09 0.0451
2 4 drat 2.12 1.07 1.99 0.0554
3 3 (Intercept) -8.12 3.88 -2.09 0.0451
4 3 drat 2.12 1.07 1.99 0.0554
5 5 (Intercept) -8.12 3.88 -2.09 0.0451
6 5 drat 2.12 1.07 1.99 0.0554
Estimates for each type of gear (that is 3,4,5) turns out to be the same.
It appears as if nesting was essentially ignored here.
Are there any solutions for combining svyglm with nest-map-unnest? Or will I have to look for other, less comfortable ways?
Thank you!
try to do it this way
mtcars%>%
nest(-gear) %>%
mutate(design = map(data, ~ svydesign(ids=~0, data = .x, weights = ~ wt)),
model = map(.x = design,
.f = ~ svyglm(vs ~ drat,
family = quasibinomial(logit),
design = .x))) %>%
mutate(model_tidy = map(model, tidy)) %>%
select(gear, model_tidy)%>%
ungroup()%>%
unnest(model_tidy)
I have the following code that is computing for every year bewteen 1961:2018 the effects of both predictor variables: base on balls per game (BB) and home runs per game (HR) on the response variable runs per game (R):
rm(list = ls())
library(dbplyr)
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
do(tidy(lm(R ~ BB + HR, data = .), conf.int = TRUE)) %>% filter(term=="BB")
fit
> fit
# A tibble: 58 x 8
# Groups: yearID [58]
yearID term estimate std.error statistic p.value conf.low conf.high
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 BB 0.0845 0.168 0.502 0.623 -0.274 0.443
2 1962 BB 0.142 0.273 0.520 0.610 -0.434 0.718
3 1963 BB 0.339 0.242 1.40 0.178 -0.171 0.849
4 1964 BB -0.105 0.302 -0.349 0.731 -0.742 0.532
5 1965 BB 0.235 0.253 0.928 0.366 -0.299 0.768
6 1966 BB 0.104 0.216 0.482 0.636 -0.351 0.559
7 1967 BB 0.0660 0.223 0.296 0.771 -0.405 0.537
8 1968 BB -0.199 0.203 -0.983 0.340 -0.627 0.229
9 1969 BB 0.153 0.163 0.942 0.357 -0.185 0.492
10 1970 BB 0.239 0.157 1.52 0.143 -0.0874 0.566
# ... with 48 more rows
I now would like to output this "fit" which is actually a tibble (or modernized data frame) into ggplot to show the estimates per year as points but also the regression line along with the CI's computed by the lm model and not simply recomputing it with geom_smooth(method = "lm").
I have tried the following without success. I know that the augment from broom should operate on the lm model output directly and therefore the following code is wrong but it illustrates what I'm trying to achieve:
augment(fit) %>%
ggplot() +
geom_point(aes(yearID, estimate)) +
geom_line(aes(yearID, .fitted), col = "blue")
How can I do that without "cheating" (double computing the lm once and then on the ggplot as well) and doing:
fit %>% ggplot(aes(yearID,estimate)) + geom_point() + geom_smooth(method = "lm")
I took a similar route to Patrick, using map() and nest():
library(tidyverse)
library(broom)
library(Lahman)
library(magrittr)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(
BB = BB / G,
HR = HR / G,
R = R / G
) %>%
nest(data = -yearID) %>%
mutate(
model = map(data, ~ lm(R ~ BB + HR, .x)), # apply model to all nested groups
m_tidy = map(model, tidy), # tidy up
est = map_dbl(m_tidy, ~ .x %>% # pull BB estimate from each group
filter(term == "BB") %>%
pull(estimate)),
)
Now at this point you could just %$% right into this next portion but I've kept them separate here so talk about mimicking the confidence interval properly. The geom_smooth() confidence interval is based on the t-distribution and not the normal distribution. Thus, we'll have to do a bit of extra work to get out intervals to work:
fit %$%
lm(est ~ yearID) %>%
augment() %>%
mutate(m.se.fit = .se.fit * qt(1 - (1-0.95)/2, nrow(fit))) %>% # 95% conf int calc
ggplot(aes(yearID, est)) +
geom_point() +
geom_line(aes(y = .fitted), col = "blue") +
geom_ribbon(aes(ymin = .fitted - m.se.fit, ymax = .fitted + m.se.fit), alpha = .2)
This plot essentially mirrors the desired plot:
fit %>% ggplot(aes(yearID, est)) +
geom_point() +
geom_smooth(method = "lm")
Created on 2019-10-23 by the reprex package (v0.3.0)
You can try map functions from the purrr package, which is included in tidyverse. A possible code for your described problem is listed below. Should also be possible with lapply if you are not that familar with the purrr package.
library(tidyverse)
library(broom)
library(Lahman)
fit <- Teams %>%
filter(yearID %in% 1961:2018) %>%
mutate(BB = BB / G,
HR = HR / G,
R = R / G) %>%
group_by(yearID) %>%
# consolidate your data
nest() %>%
# creates new nested column with your regression data
mutate(model = map(data, function(df)
tidy(lm(R ~ BB + HR, data = df), conf.int = TRUE) %>%
filter(term=="BB")
),
# extract the column estimate
model_est = map_dbl(model, function(df)
df %>% pull(estimate)
),
# extract the column conf.low
model_conf.low = map_dbl(model, function(df)
df %>% pull(conf.low)
),
# extract the column conf.high
model_conf.high = map_dbl(model, function(df)
df %>% pull(conf.high)
)
)
fit %>% ggplot(aes(yearID,model_est)) + geom_point() +
geom_line(aes(yearID, model_conf.low)) +
geom_line(aes(yearID, model_conf.high))