Suppose I have the following code to fit a hyperbolic parabola:
# attach(mtcars)
hp_fit <- lm(mpg ~ poly(wt, disp, degree = 2), data = mtcars)
Where wt is the x variable, disp is the y variable, and mpg is the z variable. (summary(hp_fit))$coefficients outputs the following:
>(summary(hp_fit))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.866173 3.389734 6.7457122 3.700396e-07
poly(wt, disp, degree = 2)1.0 -13.620499 8.033068 -1.6955539 1.019151e-01
poly(wt, disp, degree = 2)2.0 15.331818 17.210260 0.8908534 3.811778e-01
poly(wt, disp, degree = 2)0.1 -9.865903 5.870741 -1.6805208 1.048332e-01
poly(wt, disp, degree = 2)1.1 -100.022013 121.159039 -0.8255431 4.165742e-01
poly(wt, disp, degree = 2)0.2 14.719928 9.874970 1.4906301 1.480918e-01
I do not understand how to interpret the varying numbers to the right of poly() under the (Intercept) column. What is the significance of these numbers and how would I construct an equation for the hyperbolic paraboloid fit from this summary?
When you compare
with(mtcars, poly(wt, disp, degree=2))
with(mtcars, poly(wt, degree=2))
with(mtcars, poly(disp, degree=2))
the 1.0 2.0 refer to the first and second degree of wt, and the 0.1 0.2 refer to the first and second degree of disp. The 1.1 is an interaction term. You may check this by comparing:
summary(lm(mpg ~ poly(wt, disp, degree=2, raw=T), data=mtcars))$coe
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.692786e+01 7.008139762 6.6961935 4.188891e-07
# poly(wt, disp, degree=2, raw=T)1.0 -1.062827e+01 8.311169003 -1.2787937 2.122666e-01
# poly(wt, disp, degree=2, raw=T)2.0 2.079131e+00 2.333864211 0.8908534 3.811778e-01
# poly(wt, disp, degree=2, raw=T)0.1 -3.172401e-02 0.060528241 -0.5241191 6.046355e-01
# poly(wt, disp, degree=2, raw=T)1.1 -2.660633e-02 0.032228884 -0.8255431 4.165742e-01
# poly(wt, disp, degree=2, raw=T)0.2 2.019044e-04 0.000135449 1.4906301 1.480918e-01
summary(lm(mpg ~ wt*disp + I(wt^2) + I(disp^2) , data=mtcars))$coe[c(1:2, 4:3, 6:5), ]
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.692786e+01 7.008139762 6.6961935 4.188891e-07
# wt -1.062827e+01 8.311169003 -1.2787937 2.122666e-01
# I(wt^2) 2.079131e+00 2.333864211 0.8908534 3.811778e-01
# disp -3.172401e-02 0.060528241 -0.5241191 6.046355e-01
# wt:disp -2.660633e-02 0.032228884 -0.8255431 4.165742e-01
# I(disp^2) 2.019044e-04 0.000135449 1.4906301 1.480918e-01
This yields the same values. Note that I used raw=TRUE for comparison purposes.
Related
library(tidyverse)
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
# this works
map(formulas, ~ {lm(.x, mtcars)})
# this doesn't
map(formulas, ~ {with(mtcars, lm(.x))})
Error in eval(predvars, data, env) : object 'disp' not found
Working through the exercises in https://adv-r.hadley.nz/functionals.html#exercises-28, I tried to solve exercise number 6, by trying to evaluate lm() inside mtcars environment with with(), but it throws an error.
Why the last call doesn't work?
It is the environment issue. One option would be quote the components so that it would not be executed
formulas <- list(
quote(mpg ~ disp),
quote(mpg ~ I(1 / disp)),
quote(mpg ~ disp + wt),
quote(mpg ~ I(1 / disp) + wt)
)
out1 <- map(formulas, ~ with(mtcars, lm(eval(.x))))
out1
#[[1]]
#Call:
#lm(formula = eval(.x))
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
#[[2]]
#Call:
#lm(formula = eval(.x))
#Coefficients:
#(Intercept) I(1/disp)
# 10.75 1557.67
#[[3]]
#Call:
#lm(formula = eval(.x))
#Coefficients:
#(Intercept) disp wt
# 34.96055 -0.01772 -3.35083
#[[4]]
#Call:
#lm(formula = eval(.x))
#Coefficients:
#(Intercept) I(1/disp) wt
# 19.024 1142.560 -1.798
It should also work with the first method
out2 <- map(formulas, ~ lm(.x, mtcars))
There would be slight changes in the attributes and in the call, but if that is ignored,
out1[[1]]$call <- out2[[1]]$call
all.equal(out1[[1]], out2[[1]], check.attributes = FALSE)
#[1] TRUE
I want to loop over the inclusion / exclusion of certain variable but I ran into an error. Here's the problem with some sample data.
mtcars = data('mtcars')
for(i in 0:1) {
fitlm = lm(mpg ~ cyl + i * drat, data = mtcars)
}
Error in model.frame.default(formula = mpg ~ cyl + i * drat, data = mtcars, : variable lengths differ (found for 'i')
But then this will run without a problem:
fitlm = lm(mpg ~ cyl + 0 * drat, data = mtcars)
fitlm = lm(mpg ~ cyl + 1 * drat, data = mtcars)
Why do the regressions work if there's a number multiplier of the variable, but fail if it's i?
Try using as.formula as follows:
# create an empty list to store the results
fitlm <- list()
# loop, fit the model and assign the result to a new list in fitlm
for(i in 0:1) {
fitlm[[i+1]] <- lm(as.formula(paste("mpg ~ cyl +", i, "* drat")), data = mtcars)
}
You can also use purrr::map instead of loops as follows:
fitlm <- purrr::map(c(0,1), ~lm(as.formula(paste("mpg ~ cyl +", .x, "* drat")), data = mtcars))
And the result will be:
> fitlm
[[1]]
Call:
lm(formula = as.formula(paste("mpg ~ cyl +", i, "* drat")),
data = mtcars)
Coefficients:
cyl
2.79
[[2]]
Call:
lm(formula = as.formula(paste("mpg ~ cyl +", i, "* drat")),
data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
It's a bit of a hack, but you could try something of the form
fitlm = list()
for(i in 0:1) {
idrat = i*mtcars$drat
fitlm[[i+1]] = lm(mpg ~ cyl + idrat, data = mtcars)
}
which gives the result
fitlm
## [[1]]
##
## Call:
## lm(formula = mpg ~ cyl + idrat, data = mtcars)
##
## Coefficients:
## (Intercept) cyl idrat
## 37.885 -2.876 NA
##
##
## [[2]]
##
## Call:
## lm(formula = mpg ~ cyl + idrat, data = mtcars)
##
## Coefficients:
## (Intercept) cyl idrat
## 28.725 -2.484 1.872
This gets around the lm() function looking for interactions when it sees the * character, as you found when using a number.
How do we print the equation of a line on a plot?
I have 2 independent variables and would like an equation like this:
y=mx1+bx2+c
where x1=cost, x2 =targeting
I can plot the best fit line but how do i print the equation on the plot?
Maybe i cant print the 2 independent variables in one equation but how do i do it for say
y=mx1+c at least?
Here is my code:
fit=lm(Signups ~ cost + targeting)
plot(cost, Signups, xlab="cost", ylab="Signups", main="Signups")
abline(lm(Signups ~ cost))
I tried to automate the output a bit:
fit <- lm(mpg ~ cyl + hp, data = mtcars)
summary(fit)
##Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.90833 2.19080 16.847 < 2e-16 ***
## cyl -2.26469 0.57589 -3.933 0.00048 ***
## hp -0.01912 0.01500 -1.275 0.21253
plot(mpg ~ cyl, data = mtcars, xlab = "Cylinders", ylab = "Miles per gallon")
abline(coef(fit)[1:2])
## rounded coefficients for better output
cf <- round(coef(fit), 2)
## sign check to avoid having plus followed by minus for negative coefficients
eq <- paste0("mpg = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " cyl ",
ifelse(sign(cf[3])==1, " + ", " - "), abs(cf[3]), " hp")
## printing of the equation
mtext(eq, 3, line=-2)
Hope it helps,
alex
You use ?text. In addition, you should not use abline(lm(Signups ~ cost)), as this is a different model (see my answer on CV here: Is there a difference between 'controling for' and 'ignoring' other variables in multiple regression). At any rate, consider:
set.seed(1)
Signups <- rnorm(20)
cost <- rnorm(20)
targeting <- rnorm(20)
fit <- lm(Signups ~ cost + targeting)
summary(fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.1494 0.2072 0.721 0.481
# cost -0.1516 0.2504 -0.605 0.553
# targeting 0.2894 0.2695 1.074 0.298
# ...
windows();{
plot(cost, Signups, xlab="cost", ylab="Signups", main="Signups")
abline(coef(fit)[1:2])
text(-2, -2, adj=c(0,0), labels="Signups = .15 -.15cost + .29targeting")
}
Here's a solution using tidyverse packages.
The key is the broom package, whcih simplifies the process of extracting model data. For example:
fit1 <- lm(mpg ~ cyl, data = mtcars)
summary(fit1)
fit1 %>%
tidy() %>%
select(estimate, term)
Result
# A tibble: 2 x 2
estimate term
<dbl> <chr>
1 37.9 (Intercept)
2 -2.88 cyl
I wrote a function to extract and format the information using dplyr:
get_formula <- function(object) {
object %>%
tidy() %>%
mutate(
term = if_else(term == "(Intercept)", "", term),
sign = case_when(
term == "" ~ "",
estimate < 0 ~ "-",
estimate >= 0 ~ "+"
),
estimate = as.character(round(abs(estimate), digits = 2)),
term = if_else(term == "", paste(sign, estimate), paste(sign, estimate, term))
) %>%
summarize(terms = paste(term, collapse = " ")) %>%
pull(terms)
}
get_formula(fit1)
Result
[1] " 37.88 - 2.88 cyl"
Then use ggplot2 to plot the line and add a caption
mtcars %>%
ggplot(mapping = aes(x = cyl, y = mpg)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
labs(
x = "Cylinders", y = "Miles per Gallon",
caption = paste("mpg =", get_formula(fit1))
)
Plot using geom_smooth()
This approach of plotting a line really only makes sense to visualize the relationship between two variables. As #Glen_b pointed out in the comment, the slope we get from modelling mpg as a function of cyl (-2.88) doesn't match the slope we get from modelling mpg as a function of cyl and other variables (-1.29). For example:
fit2 <- lm(mpg ~ cyl + disp + wt + hp, data = mtcars)
summary(fit2)
fit2 %>%
tidy() %>%
select(estimate, term)
Result
# A tibble: 5 x 2
estimate term
<dbl> <chr>
1 40.8 (Intercept)
2 -1.29 cyl
3 0.0116 disp
4 -3.85 wt
5 -0.0205 hp
That said, if you want to accurately plot the regression line for a model that includes variables that don't appear included in the plot, use geom_abline() instead and get the slope and intercept using broom package functions. As far as I know geom_smooth() formulas can't reference variables that aren't already mapped as aesthetics.
mtcars %>%
ggplot(mapping = aes(x = cyl, y = mpg)) +
geom_point() +
geom_abline(
slope = fit2 %>% tidy() %>% filter(term == "cyl") %>% pull(estimate),
intercept = fit2 %>% tidy() %>% filter(term == "(Intercept)") %>% pull(estimate),
color = "blue"
) +
labs(
x = "Cylinders", y = "Miles per Gallon",
caption = paste("mpg =", get_formula(fit2))
)
Plot using geom_abline()
I often want to run a list of models, like so:
data(mtcars)
ms <- lapply(list(
mpg ~ disp
, mpg ~ hp
), lm, data=mtcars
I'd then like to be able to extract the call from these models, like so:
lapply(ms, getCall)
But I get this:
[[1]]
FUN(formula = X[[1L]], data = ..1)
[[2]]
FUN(formula = X[[2L]], data = ..1)
How can I get this:
[[1]]
lm(formula = mpg ~ disp, data=mtcars)
[[2]]
lm(formula = mpg ~ hp, data=mtcars)
(I figured out I could get it by just making a list of the models, like this:
ms <- list(
lm(mpg ~ disp, data=mtcars)
, lm(mpg ~ hp, data=mtcars)
)
But I'd prefer to avoid the repetition.
I have a sense that this has to do with evaluating the formula in the right environment, but I don't know how to.
I think this is an issue with the scoping rules of lapply and non-standard evaluation in lm. However, you can get around it by creating a base model, and using update in an lapply call:
tlm <- lm(data=mtcars)
lapply(list(mpg~disp,mpg~hp), update, object=tlm)
[[1]]
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[2]]
Call:
lm(formula = mpg ~ hp, data = mtcars)
Coefficients:
(Intercept) hp
30.09886 -0.06823
I think lapply is invoking lm under a different name, maybe try this:
data(mtcars)
ms <- lapply(list(
mpg ~ disp
, mpg ~ hp
), function(x) lm(x, data=mtcars) )
EDIT:
We'll have to mess with substitute too then:
data(mtcars)
f <- function(x) {z <- x; eval(substitute(lm(z, data=mtcars))) }
ms <- lapply(list(
mpg ~ disp
, mpg ~ hp
), f)
R>
R>ms
[[1]]
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985476 -0.04121512
[[2]]
Call:
lm(formula = mpg ~ hp, data = mtcars)
Coefficients:
(Intercept) hp
30.09886054 -0.06822828
If I do
data(mtcars)
m1 <- lm(mpg ~ cyl, data= mtcars, x= TRUE, y= TRUE)
then I can extract the p-value for the slope using summary(m1)$coefficients[2, 4].
But if I do
library(rms)
data(mtcars)
m2 <- ols(mpg ~ cyl, data= mtcars, x= TRUE, y= TRUE)
what do I need to do to extract the p-value for the slope?
You can use the corresponding extractor function, but you need to call summary.lm:
> coef(summary.lm(m2))
Estimate Std. Error t value Pr(>|t|)
Intercept 37.88458 2.0738436 18.267808 8.369155e-18
cyl -2.87579 0.3224089 -8.919699 6.112687e-10