R print equation of linear regression on the plot itself - r

How do we print the equation of a line on a plot?
I have 2 independent variables and would like an equation like this:
y=mx1+bx2+c
where x1=cost, x2 =targeting
I can plot the best fit line but how do i print the equation on the plot?
Maybe i cant print the 2 independent variables in one equation but how do i do it for say
y=mx1+c at least?
Here is my code:
fit=lm(Signups ~ cost + targeting)
plot(cost, Signups, xlab="cost", ylab="Signups", main="Signups")
abline(lm(Signups ~ cost))

I tried to automate the output a bit:
fit <- lm(mpg ~ cyl + hp, data = mtcars)
summary(fit)
##Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.90833 2.19080 16.847 < 2e-16 ***
## cyl -2.26469 0.57589 -3.933 0.00048 ***
## hp -0.01912 0.01500 -1.275 0.21253
plot(mpg ~ cyl, data = mtcars, xlab = "Cylinders", ylab = "Miles per gallon")
abline(coef(fit)[1:2])
## rounded coefficients for better output
cf <- round(coef(fit), 2)
## sign check to avoid having plus followed by minus for negative coefficients
eq <- paste0("mpg = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " cyl ",
ifelse(sign(cf[3])==1, " + ", " - "), abs(cf[3]), " hp")
## printing of the equation
mtext(eq, 3, line=-2)
Hope it helps,
alex

You use ?text. In addition, you should not use abline(lm(Signups ~ cost)), as this is a different model (see my answer on CV here: Is there a difference between 'controling for' and 'ignoring' other variables in multiple regression). At any rate, consider:
set.seed(1)
Signups <- rnorm(20)
cost <- rnorm(20)
targeting <- rnorm(20)
fit <- lm(Signups ~ cost + targeting)
summary(fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.1494 0.2072 0.721 0.481
# cost -0.1516 0.2504 -0.605 0.553
# targeting 0.2894 0.2695 1.074 0.298
# ...
windows();{
plot(cost, Signups, xlab="cost", ylab="Signups", main="Signups")
abline(coef(fit)[1:2])
text(-2, -2, adj=c(0,0), labels="Signups = .15 -.15cost + .29targeting")
}

Here's a solution using tidyverse packages.
The key is the broom package, whcih simplifies the process of extracting model data. For example:
fit1 <- lm(mpg ~ cyl, data = mtcars)
summary(fit1)
fit1 %>%
tidy() %>%
select(estimate, term)
Result
# A tibble: 2 x 2
estimate term
<dbl> <chr>
1 37.9 (Intercept)
2 -2.88 cyl
I wrote a function to extract and format the information using dplyr:
get_formula <- function(object) {
object %>%
tidy() %>%
mutate(
term = if_else(term == "(Intercept)", "", term),
sign = case_when(
term == "" ~ "",
estimate < 0 ~ "-",
estimate >= 0 ~ "+"
),
estimate = as.character(round(abs(estimate), digits = 2)),
term = if_else(term == "", paste(sign, estimate), paste(sign, estimate, term))
) %>%
summarize(terms = paste(term, collapse = " ")) %>%
pull(terms)
}
get_formula(fit1)
Result
[1] " 37.88 - 2.88 cyl"
Then use ggplot2 to plot the line and add a caption
mtcars %>%
ggplot(mapping = aes(x = cyl, y = mpg)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
labs(
x = "Cylinders", y = "Miles per Gallon",
caption = paste("mpg =", get_formula(fit1))
)
Plot using geom_smooth()
This approach of plotting a line really only makes sense to visualize the relationship between two variables. As #Glen_b pointed out in the comment, the slope we get from modelling mpg as a function of cyl (-2.88) doesn't match the slope we get from modelling mpg as a function of cyl and other variables (-1.29). For example:
fit2 <- lm(mpg ~ cyl + disp + wt + hp, data = mtcars)
summary(fit2)
fit2 %>%
tidy() %>%
select(estimate, term)
Result
# A tibble: 5 x 2
estimate term
<dbl> <chr>
1 40.8 (Intercept)
2 -1.29 cyl
3 0.0116 disp
4 -3.85 wt
5 -0.0205 hp
That said, if you want to accurately plot the regression line for a model that includes variables that don't appear included in the plot, use geom_abline() instead and get the slope and intercept using broom package functions. As far as I know geom_smooth() formulas can't reference variables that aren't already mapped as aesthetics.
mtcars %>%
ggplot(mapping = aes(x = cyl, y = mpg)) +
geom_point() +
geom_abline(
slope = fit2 %>% tidy() %>% filter(term == "cyl") %>% pull(estimate),
intercept = fit2 %>% tidy() %>% filter(term == "(Intercept)") %>% pull(estimate),
color = "blue"
) +
labs(
x = "Cylinders", y = "Miles per Gallon",
caption = paste("mpg =", get_formula(fit2))
)
Plot using geom_abline()

Related

How to modify variable labels in gtsummary table

As recommended in the tutorial for gtsummary's tbl_regression function, I am using the labelled package to assign attribute labels to my regression variables. However, when my regression formula includes a quadratic term, the resulting table includes the same variable label twice:
library(gtsummary)
library(labelled)
library(tidyverse)
df <- as_tibble(mtcars)
var_label(df) <- list( disp = "Displacement", vs = "Engine type")
c("disp", "disp + I(disp^2)") %>%
map(
~ paste("vs", .x, sep = " ~ ") %>%
as.formula() %>%
glm(data = df,
family = binomial(link = "logit")) %>%
tbl_regression(exponentiate = TRUE)) %>%
tbl_merge()
Is there a way to modify the label for the quadratic term in this case?
If you assign the label inside the tbl_regression() function, you'll see what you want to get.
library(gtsummary)
c("disp", "disp + I(disp^2)") %>%
purrr::map(
~ paste("vs", .x, sep = " ~ ") %>%
as.formula() %>%
glm(data = mtcars, family = binomial(link = "logit")) %>%
tbl_regression(
exponentiate = TRUE,
label = list(
disp = "Displacement",
`I(disp^2)` = "Displacement^2"
)
)
) %>%
tbl_merge() %>%
as_kable()
#> ✖ `I(disp^2)` terms have not been found in `x`.
Characteristic
OR
95% CI
p-value
OR
95% CI
p-value
Displacement
0.98
0.96, 0.99
0.002
0.99
0.92, 1.07
0.8
Displacement^2
1.00
1.00, 1.00
0.8
Created on 2022-09-19 with reprex v2.0.2

Extracting the T Statistic from a function in R

I have this function that I got from a textbook that runs a couple of linear regressions and then saves the P-Value for each regression.
I would also like to save the T-Statistic as well but I am having a hard time finding the right syntax to enter for the select function.
Here is the current function.
models <- lapply(paste(factors, ' ~ a + b + c + d + e + f + g + h+ j -',factors),
function(f){ lm(as.formula(f), data = df) %>% # Call lm(.)
summary() %>% # Gather the output
"$"(coef) %>% # Keep only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept
dplyr::select(Estimate,`Pr...t..`)}) # Keep the coef & p-value
I know that I have to change the very last part of the function: dplyr::select(Estimate,`Pr...t..`) but after all my research and trial and error I am still stuck.
Here is a reproducible example using the mtcars data.
library(dplyr)
df <- mtcars
df <- df %>%
select(1,2,3,4,5,6,7)
factors <- c("mpg", "cyl", "disp", "hp", "drat", "wt")
models <- lapply(paste(factors, ' ~ mpg + cyl + disp + hp + drat + wt -',factors),
function(f){ lm(as.formula(f), data = df) %>% # Call lm(.)
summary() %>% # Gather the output
"$"(coef) %>% # Keep only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept
dplyr::select(Estimate,`Pr...t..`)} # Keep the coef & p-value
)
final <- matrix(unlist(models), ncol = 2, byrow = T) %>% # Switch from list to dataframe
data.frame(row.names = factors
Your example works for me. You can make this a little bit more "tidy" as follows:
library(broom)
sumfun <- function(f) {
lm(as.formula(f), data = df) %>%
tidy() %>%
filter(term == "(Intercept)") %>%
dplyr::select(estimate, p.value)
}
pp <- paste(factors, ' ~ mpg + cyl + disp + hp + drat + wt -',factors)
names(pp) <- factors
final <- purrr::map_dfr(pp, sumfun, .id = "factor")

Add geom_smooth to ggplot facets conditionally based on p-value

I'm using ggplot to visualize many linear regressions and facet them by groups. I'd like geom_smooth() to show the trend line as one color if P < 0.05, a different color if P < 0.10, and not show it at all if P ≥ 0.10.
I managed to do this using a loop to extract P-values from lm() for each regression, then join them with the data used for plotting. Then I add another column of color names to pass to aes(), determined conditionally from the P-values, and use scale_color_identity() to achieve my goal.
Here's an example:
library(tidyverse)
#make mtcars a tibble and cyl a factor, for convenience
mtcars1 <- as_tibble(mtcars) %>% dplyr::mutate(cyl = as.factor(cyl))
#initialize a list to store p-values from lm() for each level of factor
p.list <- vector(mode = "list", length = length(levels(mtcars1$cyl)))
names(p.list) <- levels(mtcars1$cyl)
#loop to calculate p-values for each level of mtcars$cyl
for(i in seq_along(levels(mtcars1$cyl))){
mtcars.sub <- mtcars1 %>% dplyr::filter(cyl == levels(.$cyl)[i])
lm.pval <- mtcars.sub %>%
dplyr::distinct(cyl) %>%
dplyr::mutate(P =
summary(lm(mpg ~ disp, data = mtcars.sub))$coefficients[2,4] ##extract P-value
)
p.list[[i]] <- lm.pval
}
#join p-values to dataset and add column to use with scale_color_identity()
mtcars.p <- mtcars1 %>% dplyr::left_join(dplyr::bind_rows(p.list, .id = "cyl"), by = "cyl") %>%
dplyr::mutate(p.color = ifelse(P < 0.05, "black",
ifelse(P < 0.10, "lightblue", NA)))
#plot
ggplot(data = mtcars.p, aes(x = disp, y = mpg)) +
geom_smooth(method = "lm",
se = FALSE,
aes(color = p.color)) +
geom_point() +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
This seems like too many initial steps for something that should be relatively easy. Are these steps necessary, or is there a more efficient way of doing this? Can ggplot or any other packages out there do this on their own, without having to first extract p-values from lm()?
After specifying your regression function, you can include the line function within ggplot:
myline<-lm(mpg ~ disp, data = mtcars)
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_abline(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue')+
geom_point(color='red') +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
The same as above, you can use this geom_smooth() command as well:
geom_smooth(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue',se=F,method='lm')+
We may simplify the steps with a group by operation and also instead of extracting each component, the output can be in a tibble with tidy from broom
library(broom)
library(dplyr)
library(tidyr)
mtcars1 %>%
group_by(cyl) %>%
summarise(out = list(tidy(lm(mpg ~ disp, data = cur_data())))) %>%
unnest(out)
-output
# A tibble: 6 x 6
cyl term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) 40.9 3.59 11.4 0.00000120
2 4 disp -0.135 0.0332 -4.07 0.00278
3 6 (Intercept) 19.1 2.91 6.55 0.00124
4 6 disp 0.00361 0.0156 0.232 0.826
5 8 (Intercept) 22.0 3.35 6.59 0.0000259
6 8 disp -0.0196 0.00932 -2.11 0.0568

ANOVA problems with revoScaleR::rxGlm() in R

I build lots of GLMs. Usually on large data sets with many model parameters. This means that base R's glm() function isn't really useful because it won't cope with the size/complexity, so I usually use revoScaleR::rxGlm() instead.
However I'd like to be able to do ANOVA tests on pairs of nested models, and I haven't found a way to do this with the model objects that rxGlm() creates, because R's anova() function won't work with them. revoScaleR provides an as.glm() function which converts an rxGlm() object to a glm() object - sort of - but it doesn't work here.
For example:
library(dplyr)
data(mtcars)
# don't like having named rows
mtcars <- mtcars %>%
mutate(veh_name = rownames(.)) %>%
select(veh_name, everything())
# fit a GLM: mpg ~ everything else
glm_a1 <- glm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
data = mtcars,
family = gaussian(link = "identity"),
trace = TRUE)
summary(glm_a1)
# fit another GLM where gear is removed
glm_a2 <- glm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + carb,
data = mtcars,
family = gaussian(link = "identity"),
trace = TRUE)
summary(glm_a2)
# F test on difference
anova(glm_a1, glm_a2, test = "F")
works fine, but if instead I do:
library(dplyr)
data(mtcars)
# don't like having named rows
mtcars <- mtcars %>%
mutate(veh_name = rownames(.)) %>%
select(veh_name, everything())
glm_b1 <- rxGlm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
data = mtcars,
family = gaussian(link = "identity"),
verbose = 1)
summary(glm_b1)
# fit another GLM where gear is removed
glm_b2 <- rxGlm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + carb,
data = mtcars,
family = gaussian(link = "identity"),
verbose = 1)
summary(glm_b2)
# F test on difference
anova(as.glm(glm_b1), as.glm(glm_b2), test = "F")
I see the error message:
Error in qr.lm(object) : lm object does not have a proper 'qr'
component. Rank zero or should not have used lm(.., qr=FALSE)
The same problem cropped up on a previous SO posting: Error converting rxGlm to GLM but doesn't seem to have been solved.
Can anyone help please? if as.glm() isn't going to help here, is there some other way? Could I write a custom function to do this (stretching my coding abilities to their limit I suspect!)?
Also, is SO the best forum, or would one of the other StackExchange forums be a better place to look for guidance?
Thank you.
Partial solution...
my_anova <- function (model_1, model_2, test_type)
{
# only applies for nested GLMs. How do I test for this?
cat("\n")
if(test_type != "F")
{
cat("Invalid function call")
}
else
{
# display model formulae
cat("Model 1:", format(glm_b1$formula), "\n")
cat("Model 2:", format(glm_b2$formula), "\n")
if(test_type == "F")
{
if (model_1$df[2] < model_2$df[2]) # model 1 is big, model 2 is small
{
dev_s <- model_2$deviance
df_s <- model_2$df[2]
dev_b <- model_1$deviance
df_b <- model_1$df[2]
}
else # model 2 is big, model 1 is small
{
dev_s <- model_1$deviance
df_s <- model_1$df[2]
dev_b <- model_2$deviance
df_b <- model_2$df[2]
}
F <- (dev_s - dev_b) / ((df_s - df_b) * dev_b / df_b)
}
# still need to calculate the F tail probability however
# df of F: numerator: df_s - df_b
# df of F: denominator: df_b
F_test <- pf(F, df_s - df_b, df_b, lower.tail = FALSE)
cat("\n")
cat("F: ", round(F, 4), "\n")
cat("Pr(>F):", round(F_test, 4))
}
}

dplyr: Evaluation error: object '.' not found with gamlss but all good with lm, gam, glm methods

Context: tidyverse and dplyr environment/work-flow.
I'd appreciate insights into how to resolve the following issue, which I have encountered while trying to work with collections of regression results.
This minimal reproducible shows the issue
mtcars %>%
gamlss(mpg ~ hp + wt + disp, data = .) %>%
model.frame()
The example below illustrates a broader context and works as expected (producing the images shown). It also works if all I do is change ~lm(...) to be ~glm(...) or ~gam(...):
library(tidyverse)
library(broom)
library(gamlss)
library(datasets)
mtcars %>%
nest(-am) %>%
mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")),
fit = map(data, ~lm(mpg ~ hp + wt + disp, data = .)),
results = map(fit, augment)) %>%
unnest(results) %>%
ggplot(aes(x = mpg, y = .fitted)) +
geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit
geom_point() +
facet_grid(am ~ .) +
labs(x = "Miles Per Gallon", y = "Predicted Value") +
theme_bw()
However, If I try to use ~gamlss(...) as follows:
mtcars %>%
nest(-am) %>%
mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")),
fit = map(data, ~gamlss(mpg ~ hp + wt + disp, data = .)),
results = map(fit, augment)) %>%
unnest(results) %>%
ggplot(aes(x = mpg, y = .fitted)) +
geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit
geom_point() +
facet_grid(am ~ .) +
labs(x = "Miles Per Gallon", y = "Predicted Value") +
theme_bw()
I observe the following error:
GAMLSS-RS iteration 1: Global Deviance = 58.7658
GAMLSS-RS iteration 2: Global Deviance = 58.7658
GAMLSS-RS iteration 1: Global Deviance = 76.2281
GAMLSS-RS iteration 2: Global Deviance = 76.2281
******************************************************************
Family: c("NO", "Normal")
Call: gamlss(formula = mpg ~ hp + wt + disp, data = .)
Fitting method: RS()
------------------------------------------------------------------
Mu link function: identity
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.811721 3.387118 12.935 4.05e-07 ***
hp 0.001768 0.021357 0.083 0.93584
wt -6.982534 1.998827 -3.493 0.00679 **
disp -0.019569 0.021460 -0.912 0.38559
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8413 0.1961 4.29 0.00105 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
No. of observations in the fit: 13
Degrees of Freedom for the fit: 5
Residual Deg. of Freedom: 8
at cycle: 2
Global Deviance: 58.76579
AIC: 68.76579
SBC: 71.59054
******************************************************************
Error in mutate_impl(.data, dots) :
Evaluation error: object '.' not found.
In addition: Warning messages:
1: Deprecated: please use `purrr::possibly()` instead
2: Deprecated: please use `purrr::possibly()` instead
3: Deprecated: please use `purrr::possibly()` instead
4: Deprecated: please use `purrr::possibly()` instead
5: Deprecated: please use `purrr::possibly()` instead
6: In summary.gamlss(model) :
summary: vcov has failed, option qr is used instead
15: stop(list(message = "Evaluation error: object '.' not found.",
call = mutate_impl(.data, dots), cppstack = NULL))
14: .Call(`_dplyr_mutate_impl`, df, dots)
13: mutate_impl(.data, dots)
12: mutate.tbl_df(tbl_df(.data), ...)
11: mutate(tbl_df(.data), ...)
10: as.data.frame(mutate(tbl_df(.data), ...))
9: mutate.data.frame(., am = factor(am, levels = c(0, 1), labels = c("automatic",
"manual")), fit = map(data, ~gamlss(mpg ~ hp + wt + disp,
data = .)), results = map(fit, augment))
8: mutate(., am = factor(am, levels = c(0, 1), labels = c("automatic",
"manual")), fit = map(data, ~gamlss(mpg ~ hp + wt + disp,
data = .)), results = map(fit, augment))
7: function_list[[i]](value)
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: mtcars %>% nest(-am) %>% mutate(am = factor(am, levels = c(0,
1), labels = c("automatic", "manual")), fit = map(data, ~gamlss(mpg ~
hp + wt + disp, data = .)), results = map(fit, augment)) %>%
unnest(results) %>% ggplot(aes(x = mpg, y = .fitted))
Does anyone have suggestion as to what needs to change to make this example work as expected?
I'd appreciate any insights into what is going wrong. Why it does not work. How to diagnose this sort of issue(s).
model.frame.gamlss does not consider the original environment of the data argument properly.
See my comments in the code below:
model.frame.gamlss <- function(formula, what = c("mu", "sigma", "nu", "tau"), parameter = NULL, ...)
{
object <- formula
dots <- list(...)
what <- if (!is.null(parameter)) {
match.arg(parameter, choices = c("mu", "sigma", "nu", "tau"))
} else match.arg(what)
Call <- object$call
parform <- formula(object, what)
data <- if (!is.null(Call$data)) {
# problem here, as Call$data is .
eval(Call$data)
# instead, this would work:
# eval(Call$data, environment(formula$mu.terms))
# (there is no formula$terms, just mu.terms and sigma.terms)
} else {
environment(formula$terms)
}
Terms <- terms(parform)
mf <- model.frame(
Terms,
data,
xlev = object[[paste(what, "xlevels", sep = ".")]]
)
mf
}
I guess an issue should be filed with the gamlss maintainer(s), unless this has been done already.
Building on RolnadASc's partial answer...
The intermediate datasets and the chart output of the original problem are reproduced by the following. An improvement from using this approach is that we do not require the intermediate step creating fit.
library(tidyverse)
library(broom)
library(gamlss)
library(datasets)
model.frame.gamlss <- function(formula, what = c("mu", "sigma", "nu", "tau"), parameter = NULL, ...)
{
object <- formula
dots <- list(...)
what <- if (!is.null(parameter)) {
match.arg(parameter, choices = c("mu", "sigma", "nu", "tau"))
} else match.arg(what)
Call <- object$call
parform <- formula(object, what)
data <- if (!is.null(Call$data)) {
# problem here, as Call$data is .
# eval(Call$data)
# instead, this would work:
eval(Call$data, environment(formula$mu.terms))
# (there is no formula$terms, just mu.terms and sigma.terms)
} else {
environment(formula$terms)
}
Terms <- terms(parform)
mf <- model.frame(
Terms,
data,
xlev = object[[paste(what, "xlevels", sep = ".")]]
)
mf
}
aug_func <- function(df){
augment(gamlss(mpg ~ hp + wt + disp, data=df))
}
mtcars %>%
mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual"))) %>%
group_by(am) %>%
do(aug=aug_func(df=.)) %>%
unnest(aug) %>%
ggplot(aes(x = mpg, y = .fitted)) +
geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit
geom_point() +
facet_grid(am ~ .) +
labs(x = "Miles Per Gallon gamlss", y = "Predicted Value gamlss") +
theme_bw()

Resources