Suppress fixed effects coefficients in R - r

is there a way to suppress the coefficients for fixed effects in a linear model when using the summary() function (e.g. the equivalent of the absorb() function in stata). For example, I would like to have the summary function output just the intercept and x rather than the coefficients and standard errors for the factors as well:
frame <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(c("A", "B", "C", "D"),25))
summary(lm(y~x + as.factor(z), data = frame))
Call:
lm(formula = y ~ x + as.factor(z), data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.2417 -0.6407 0.1783 0.5789 2.4347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.25829 0.19196 -1.346 0.1816
x 0.09983 0.09788 1.020 0.3104
...
Thanks.

You could do any of these:
mod <- lm(y~x + as.factor(z), data = frame)
coef(mod)[c("(Intercept)", "x")]
# (Intercept) x
# 0.12357491 -0.06430765
coef(mod)[grepl("Intercept|x", names(coef(mod)))]
# (Intercept) x
# 0.12357491 -0.06430765
coef(mod)[1:2]
# (Intercept) x
# 0.12357491 -0.06430765
mod$coefficients[1:2]
# (Intercept) x
# 0.12357491 -0.06430765

I can highly recommend the stargazer package:
frame <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(c("A", "B", "C", "D"),25))
summary(lm(y~x + as.factor(z), data = frame))
model <- lm(y~x + as.factor(z), data = frame)
library(stargazer)
stargazer(model, type="text")
stargazer(model, omit = "z", type="text")

Related

code for linear regression scatterplot residuals scatterplot

I ran a linear regression
lm.fit <- lm(intp.trust~age+v225+age*v225+v240+v241+v242,data=intp.trust)
summary(lm.fit)
and get the following results
Call:
lm(formula = intp.trust ~ age + v225 + age * v225 + v240 + v241 +
v242, data = intp.trust)
Residuals:
Min 1Q Median 3Q Max
-1.32050 -0.33299 -0.04437 0.30899 2.35520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.461e+00 2.881e-02 85.418 < 2e-16 ***
age -2.416e-03 5.144e-04 -4.697 2.66e-06 ***
v225 5.794e-04 1.574e-02 0.037 0.971
v240 2.111e-02 2.729e-03 7.734 1.07e-14 ***
v241 -1.177e-03 1.958e-04 -6.014 1.83e-09 ***
v242 -1.473e-02 4.166e-04 -35.354 < 2e-16 ***
age:v225 4.214e-06 3.101e-04 0.014 0.989
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4833 on 34845 degrees of freedom
(21516 observations deleted due to missingness)
Multiple R-squared: 0.05789, Adjusted R-squared: 0.05773
F-statistic: 356.8 on 6 and 34845 DF, p-value: < 2.2e-16
"consider the residuals from the regression above. compare the residual distributions for females and males using an appropriate graph?"
Males and females is coded using variable v225. How do I go about on creating this graph?
at first I created :
lm.res <- resid(lm.fit)
but I'm not sure what the next step is.
The graph is supposed to be a scatterplot of residuals with different colour for females and males.
I tried this but was not working
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()
In this line:
ggplot(intp.trust, aes(x = intp.trust, y = lm.res, color = v225)) + geom_point()
You are saying: "go look in the data.frame intp.trust for a variable called lm.res, and plot that as y"
But you created lm.res as standalone object, not as a column of intp.trust. Assign the residuals from your model to a new column in the data.frame like this:
intp.trust$lm.res <- resid(lm.fit)
And it should work. Example with dummy data:
library(ggplot2)
# generate data
true_function <- function(x, is_female) {
ifelse(is_female, 5, 2) +
ifelse(is_female, -1.5, 1.5) * x +
rnorm(length(x))
}
set.seed(123)
dat <- data.frame(x = runif(200, 1, 5), is_female = rbinom(200, 1, .5))
dat$y <- with(dat, true_function(x, is_female))
# regression
lm_fit <- lm(y ~ x + as.factor(is_female), data=dat)
# add residuals to data.frame
dat$resid <- resid(lm_fit)
# plot
ggplot(dat, aes(x=x, y=resid, color=as.factor(is_female))) +
geom_point()
Here is a sample that you could follow and get what you want
# Sample Data
x_1 <- rnorm(100)
x_2 <- runif(100, 10, 30)
x_3 <- rnorm(100) * runif(100)
y <- rnorm(100, mean = 10)
gender <- sample(c("F", "M"), replace = TRUE)
df <- data.frame(x_1, x_2, x_3, y, gender)
# Fit model
lm.fit <- lm(y ~ x_1 + x_2 + x_1 * x_2 + x_3, data = df)
# Update data.frame
df$residuals <- lm.fit$residuals
# Scatter Residuals
ggplot(df) +
geom_point(aes(x = as.numeric(row.names(df)), y = residuals, color = gender)) +
labs(x = 'Index', y = 'Residual value', title = 'Residual scatter plot')

Multiplying results regression R

I just did a regression in R. I would like to multiply the results of each coefficient with some variables. How can I do it?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.210e+00 7.715e-01 1.568 0.13108
SDCHRO_I -1.846e-01 2.112e-01 -0.874 0.39157
functional_cognitive_level3 4.941e-02 7.599e-02 0.650 0.52224
rev_per_members -4.955e-06 5.827e-06 -0.850 0.40432
And I want something like this:
1.210e+00 + -1.846e-01 * var1 + 4.941e-02 * var2 + 4.941e-02 * var3
Is there a way to do it?
You can use model.matrix.
data(mtcars)
lm1 <- lm(mpg~cyl+disp+hp, data=mtcars)
res <- coef(lm1) %*% t(model.matrix(lm1))
all(res==predict(lm1))
#[1] TRUE
You can access the coefficients with model$coefficients.
For example, if you want to multiply all coefficients with 10, you can do
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients
#> (Intercept) x z x:z
#> 0.6449097 -0.1989884 -0.3962655 0.4621273
mod$coefficients*10
#> (Intercept) x z x:z
#> 6.449097 -1.989884 -3.962655 4.621273
Created on 2020-07-10 by the reprex package (v0.3.0)
However, if you want to do like in your example, you need to access the invididual coefficients with model$coefficients[i], e.g.
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients[1]*10
#> (Intercept)
#> 5.994662
mod$coefficients[2]*10
#> x
#> -1.687928
Created on 2020-07-10 by the reprex package (v0.3.0)
You can even do this dynamically by looping over the length of the coefficients object.

ggplot exponential smooth with tuning parameter inside exp

ggplot provides various "smoothing methods" or "formulas" that determine the form of the trend line. However it is unclear to me how the parameters of the formula are specified and how I can get the exponential formula to fit my data. In other words how to tell ggplot that it should fit the parameter inside the exp.
df <- data.frame(x = c(65,53,41,32,28,26,23,19))
df$y <- c(4,3,2,8,12,8,20,15)
x y
1 65 4
2 53 3
3 41 2
4 32 8
5 28 12
6 26 8
7 23 20
8 19 15
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(x)) +
geom_point()
p
Problematic fit:
However if the parameter inside the exponential is fit then the form of the trend line becomes reasonable:
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(-0.09 * x)) +
geom_point()
p
Here is an approach with method nls instead of glm.
You can pass additional parameters to nls with a list supplied in method.args =. Here we define starting values for the a and r coefficients to be fit from.
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 10, r = -0.01)),
color = "black") +
geom_point()
As discussed in the comments, the best way to get the coefficients on the graph is by fitting the model outside the ggplot call.
model.coeff <- coef(nls( y ~ a * exp(r * x), data = df, start = c(a = 50, r = -0.04)))
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 50, r = -0.04)),
color = "black") +
geom_point() +
geom_text(x = 40, y = 15,
label = as.expression(substitute(italic(y) == a %.% italic(e)^(r %.% x),
list(a = format(unname(model.coeff["a"]),digits = 3),
r = format(unname(model.coeff["r"]),digits = 3)))),
parse = TRUE)
Firstly, to pass additional parameters to the function passed to the method param of geom_smooth, you can pass a list of named parameters to method.args.
Secondly, the problem you're seeing is that glm is placing the coefficient in front of the whole term: y ~ coef * exp(x) instead of inside: y ~ exp(coef * x) like you want. You could use optimization to solve the latter outside of glm, but you can fit it into the GLM paradigm by a transformation: a log link. This works because it's like taking the equation you want to fit, y = exp(coef * x), and taking the log of both sides, so you're now fitting log(y) = coef * x, which is equivalent to what you want to fit and works with the GLM paradigm. (This ignores the intercept. It also ends up in transformed link units, but it's easy enough to convert back if you like.)
You can run this outside of ggplot to see what the models look like:
df <- data.frame(
x = c(65,53,41,32,28,26,23,19),
y <- c(4,3,2,8,12,8,20,15)
)
bad_model <- glm(y ~ exp(x), family = gaussian(link = 'identity'), data = df)
good_model <- glm(y ~ x, family = gaussian(link = 'log'), data = df)
# this is bad
summary(bad_model)
#>
#> Call:
#> glm(formula = y ~ exp(x), family = gaussian(link = "identity"),
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.7143 -2.9643 -0.8571 3.0357 10.2857
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.714e+00 2.437e+00 3.986 0.00723 **
#> exp(x) -3.372e-28 4.067e-28 -0.829 0.43881
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 41.57135)
#>
#> Null deviance: 278.00 on 7 degrees of freedom
#> Residual deviance: 249.43 on 6 degrees of freedom
#> AIC: 56.221
#>
#> Number of Fisher Scoring iterations: 2
# this is better
summary(good_model)
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "log"), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.745 -2.600 0.046 1.812 6.080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.93579 0.51361 7.663 0.000258 ***
#> x -0.05663 0.02054 -2.757 0.032997 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 12.6906)
#>
#> Null deviance: 278.000 on 7 degrees of freedom
#> Residual deviance: 76.143 on 6 degrees of freedom
#> AIC: 46.728
#>
#> Number of Fisher Scoring iterations: 6
From here, you can reproduce what geom_smooth is going to do: make a sequence of x values across the domain and use the predictions as the y values for the line:
# new data is a sequence across the domain of the model
new_df <- data.frame(x = seq(min(df$x), max(df$x), length = 501))
# `type = 'response'` because we want values for y back in y units
new_df$bad_pred <- predict(bad_model, newdata = new_df, type = 'response')
new_df$good_pred <- predict(good_model, newdata = new_df, type = 'response')
library(tidyr)
library(ggplot2)
new_df %>%
# reshape to long form for ggplot
gather(model, y, contains('pred')) %>%
ggplot(aes(x, y)) +
geom_line(aes(color = model)) +
# plot original points on top
geom_point(data = df)
Of course, it's a lot easier to let ggplot handle all that for you:
ggplot(df, aes(x, y)) +
geom_smooth(
method = 'glm',
formula = y ~ x,
method.args = list(family = gaussian(link = 'log'))
) +
geom_point()

How to use a variable in lm() function in R?

Let us say I have a dataframe (df) with two columns called "height" and "weight".
Let's say I define:
x = "height"
How do I use x within my lm() function? Neither df[x] nor just using x works.
Two ways :
Create a formula with paste
x = "height"
lm(paste0(x, '~', 'weight'), df)
Or use reformulate
lm(reformulate("weight", x), df)
Using reproducible example with mtcars dataset :
x = "Cyl"
lm(paste0(x, '~', 'mpg'), data = mtcars)
#Call:
#lm(formula = paste0(x, "~", "mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
and same with
lm(reformulate("mpg", x), mtcars)
We can use glue to create the formula
x <- "height"
lm(glue::glue('{x} ~ weight'), data = df)
Using a reproducible example with mtcars
x <- 'cyl'
lm(glue::glue('{x} ~ mpg'), data = mtcars)
#Call:
#lm(formula = glue::glue("{x} ~ mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
When you run x = "height" your are assigning a string of characters to the variable x.
Consider this data frame:
df <- data.frame(
height = c(176, 188, 165),
weight = c(75, 80, 66)
)
If you want a regression using height and weight you can either do this:
lm(height ~ weight, data = df)
# Call:
# lm(formula = height ~ weight, data = df)
#
# Coefficients:
# (Intercept) weight
# 59.003 1.593
or this:
lm(df$height ~ df$weight)
# Call:
# lm(formula = df$height ~ df$weight)
#
# Coefficients:
# (Intercept) df$weight
# 59.003 1.593
If you really want to use x instead of height, you must have a variable called x (in your df or in your environment). You can do that by creating a new variable:
x <- df$height
y <- df$weight
lm(x ~ y)
# Call:
# lm(formula = x ~ y)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593
Or by changing the names of existing variables:
names(df) <- c("x", "y")
lm(x ~ y, data = df)
# Call:
# lm(formula = x ~ y, data = df)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593

Using updating a linear model with lagged new variables

I have a base model y ~ x1 + x2.
I want to update the model to contain y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2).
x3 and x4 are also dynamically selected.
fmla <- as.formula(paste('.~.', paste(c(x3, x4), collapse = '+')))
My update formula: update(fit, fmla)
I get a error saying x3/x4 is not found from the as.formula function. I understand the error just not how to get around to what I want to do.
A possible solution for your problem can be:
# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))
# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)
# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265
Below an example with the dynlm package.
data("USDistLag", package = "lmtest")
# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )
# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448
# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067

Resources