Multiplying results regression R - r

I just did a regression in R. I would like to multiply the results of each coefficient with some variables. How can I do it?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.210e+00 7.715e-01 1.568 0.13108
SDCHRO_I -1.846e-01 2.112e-01 -0.874 0.39157
functional_cognitive_level3 4.941e-02 7.599e-02 0.650 0.52224
rev_per_members -4.955e-06 5.827e-06 -0.850 0.40432
And I want something like this:
1.210e+00 + -1.846e-01 * var1 + 4.941e-02 * var2 + 4.941e-02 * var3
Is there a way to do it?

You can use model.matrix.
data(mtcars)
lm1 <- lm(mpg~cyl+disp+hp, data=mtcars)
res <- coef(lm1) %*% t(model.matrix(lm1))
all(res==predict(lm1))
#[1] TRUE

You can access the coefficients with model$coefficients.
For example, if you want to multiply all coefficients with 10, you can do
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients
#> (Intercept) x z x:z
#> 0.6449097 -0.1989884 -0.3962655 0.4621273
mod$coefficients*10
#> (Intercept) x z x:z
#> 6.449097 -1.989884 -3.962655 4.621273
Created on 2020-07-10 by the reprex package (v0.3.0)
However, if you want to do like in your example, you need to access the invididual coefficients with model$coefficients[i], e.g.
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients[1]*10
#> (Intercept)
#> 5.994662
mod$coefficients[2]*10
#> x
#> -1.687928
Created on 2020-07-10 by the reprex package (v0.3.0)
You can even do this dynamically by looping over the length of the coefficients object.

Related

R Different Prediction Result for Formula Containing "%>%"

In R, when I use "predict" to get a confidence interval for a certain x (x=42) under the model: y = (centered x) + (centered x)^2. I found two possible ways:
model1 = lm(y ~ scale(x, center=T, scale=F) + I( (scale(x, center=T, scale=F))^2 ), data=data)
model2 = lm(y ~ (x %>% scale(center=T, scale=F)) + I( (x %>% scale(center=T, scale=F))^2 ), data=data)
The summary results for the two models are the same. But when I ran:
predict(model1, data.frame(x=42), interval="confidence", level=0.95)
predict(model2, data.frame(x=42), interval="confidence", level=0.95)
The results are different. I am wondering why. Does R treat the above two formulas differently because of the usage of "%>%"?
The dataset is a practice dataset from Kutner's textbook SENIC.txt, y is the 11th column, x is the 12th column.
The issue here is with scale and not %>%. scale returns a matrix which seems to affect the outcome.
One solution is to write vector --> vector equivalent of scale and use it:
library("magrittr") # for `%>%`
set.seed(1)
dataset = data.frame(x = rnorm(30))
dataset[["y"]] = 1 + (3 * dataset[["x"]]) + rnorm(30, mean = 0, sd = 0.1)
scale_vector = function(x, ...){
stopifnot(inherits(x, "numeric"))
scale(x, ...)[, 1]
}
lm(y ~ scale_vector(x, center=T, scale=F) + I( (scale_vector(x, center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ scale_vector(x, center = T, scale = F) + I((scale_vector(x,
#> center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> scale_vector(x, center = T, scale = F)
#> 2.99296
#> I((scale_vector(x, center = T, scale = F))^2)
#> -0.01645
lm(y ~ (x %>% scale_vector(center=T, scale=F)) + I( (x %>% scale_vector(center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ (x %>% scale_vector(center = T, scale = F)) +
#> I((x %>% scale_vector(center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> x %>% scale_vector(center = T, scale = F)
#> 2.99296
#> I((x %>% scale_vector(center = T, scale = F))^2)
#> -0.01645
Besides, if you do not mind using tidyverse, this might be cleaner:
library("magrittr") # for `%>%`
set.seed(1)
dataset = tibble::tibble(x = rnorm(30),
y = 1 + (3 * x) + rnorm(30, mean = 0, sd = 0.1)
)
dataset %>%
dplyr::mutate(x_scaled = scale(x, center = TRUE, scale = FALSE)[, 1]) %>%
lm(y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Call:
#> lm(formula = y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Coefficients:
#> (Intercept) x_scaled I(x_scaled^2)
#> 1.27423 2.99296 -0.01645

How to conduct joint significance test in seemingly unrelated regression

I'm trying to conduct a joint test of significance in a seemingly unrelated regression setup with robust standard errors. I have three outcomes Y1, Y2, and Y3 and I want to conduct a joint hypothesis test against the null that the average effect of the treatment Z is zero on all three outcomes.
I think that I have the model set up correctly, but I don't think that I have the hypothesis.matrix set correctly in car::linearHypothesis.
Here's some data:
library(tibble)
library(car)
library(systemfit)
set.seed(343)
N = 800
dat <-
tibble(
U = rnorm(N),
Z = rbinom(N, 1, 0.5),
Y = 0.2 * Z + U,
Y1 = Y + rnorm(N, sd = 0.3),
Y2 = Y + rnorm(N, sd = 0.5),
Y3 = Y + rnorm(N, sd = 0.5)
)
Here's the seemingly unrelated regression fit:
sur <- systemfit(list(Y1 ~ Z, Y2 ~ Z, Y3 ~ Z), method = "SUR", data = dat)
summary(sur)
Which is identical to the ols fit in this case:
ols <- lm(cbind(Y1, Y2, Y3) ~ Z, data = dat)
summary(ols)
Which is useful, because I need to estimate robust standard errors for this test:
linearHypothesis(ols, hypothesis.matrix = "Z = 0", white.adjust = "hc2")
This last line is the one that I think is incorrect. I think it's incorrect because the individual coefficients all have lower p-values than the joint test, but I could be wrong?
Looks right to me. You'd get the same result by estimating the null model (ols0 below) and using anova() to test the difference between the estimated and null models.
library(tibble)
library(car)
#> Loading required package: carData
set.seed(343)
N = 800
dat <-
tibble(
U = rnorm(N),
Z = rbinom(N, 1, 0.5),
Y = 0.2 * Z + U,
Y1 = Y + rnorm(N, sd = 0.3),
Y2 = Y + rnorm(N, sd = 0.5),
Y3 = Y + rnorm(N, sd = 0.5)
)
ols <- lm(cbind(Y1, Y2, Y3) ~ Z, data = dat)
linearHypothesis(ols, hypothesis.matrix = "Z = 0")
#>
#> Sum of squares and products for the hypothesis:
#> Y1 Y2 Y3
#> Y1 3.201796 4.693391 3.359617
#> Y2 4.693391 6.879863 4.924734
#> Y3 3.359617 4.924734 3.525216
#>
#> Sum of squares and products for error:
#> Y1 Y2 Y3
#> Y1 829.5535 756.1586 770.0808
#> Y2 756.1586 965.5959 770.4636
#> Y3 770.0808 770.4636 980.0664
#>
#> Multivariate Tests:
#> Df test stat approx F num Df den Df Pr(>F)
#> Pillai 1 0.0073689 1.96972 3 796 0.11703
#> Wilks 1 0.9926311 1.96972 3 796 0.11703
#> Hotelling-Lawley 1 0.0074236 1.96972 3 796 0.11703
#> Roy 1 0.0074236 1.96972 3 796 0.11703
ols0 <- lm(cbind(Y1, Y2, Y3) ~ 1, data = dat)
anova(ols, ols0, test="Pillai")
#> Analysis of Variance Table
#>
#> Model 1: cbind(Y1, Y2, Y3) ~ Z
#> Model 2: cbind(Y1, Y2, Y3) ~ 1
#> Res.Df Df Gen.var. Pillai approx F num Df den Df Pr(>F)
#> 1 798 0.48198
#> 2 799 1 0.48257 0.0073689 1.9697 3 796 0.117
Created on 2022-07-08 by the reprex package (v2.0.1)

ggplot exponential smooth with tuning parameter inside exp

ggplot provides various "smoothing methods" or "formulas" that determine the form of the trend line. However it is unclear to me how the parameters of the formula are specified and how I can get the exponential formula to fit my data. In other words how to tell ggplot that it should fit the parameter inside the exp.
df <- data.frame(x = c(65,53,41,32,28,26,23,19))
df$y <- c(4,3,2,8,12,8,20,15)
x y
1 65 4
2 53 3
3 41 2
4 32 8
5 28 12
6 26 8
7 23 20
8 19 15
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(x)) +
geom_point()
p
Problematic fit:
However if the parameter inside the exponential is fit then the form of the trend line becomes reasonable:
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(-0.09 * x)) +
geom_point()
p
Here is an approach with method nls instead of glm.
You can pass additional parameters to nls with a list supplied in method.args =. Here we define starting values for the a and r coefficients to be fit from.
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 10, r = -0.01)),
color = "black") +
geom_point()
As discussed in the comments, the best way to get the coefficients on the graph is by fitting the model outside the ggplot call.
model.coeff <- coef(nls( y ~ a * exp(r * x), data = df, start = c(a = 50, r = -0.04)))
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 50, r = -0.04)),
color = "black") +
geom_point() +
geom_text(x = 40, y = 15,
label = as.expression(substitute(italic(y) == a %.% italic(e)^(r %.% x),
list(a = format(unname(model.coeff["a"]),digits = 3),
r = format(unname(model.coeff["r"]),digits = 3)))),
parse = TRUE)
Firstly, to pass additional parameters to the function passed to the method param of geom_smooth, you can pass a list of named parameters to method.args.
Secondly, the problem you're seeing is that glm is placing the coefficient in front of the whole term: y ~ coef * exp(x) instead of inside: y ~ exp(coef * x) like you want. You could use optimization to solve the latter outside of glm, but you can fit it into the GLM paradigm by a transformation: a log link. This works because it's like taking the equation you want to fit, y = exp(coef * x), and taking the log of both sides, so you're now fitting log(y) = coef * x, which is equivalent to what you want to fit and works with the GLM paradigm. (This ignores the intercept. It also ends up in transformed link units, but it's easy enough to convert back if you like.)
You can run this outside of ggplot to see what the models look like:
df <- data.frame(
x = c(65,53,41,32,28,26,23,19),
y <- c(4,3,2,8,12,8,20,15)
)
bad_model <- glm(y ~ exp(x), family = gaussian(link = 'identity'), data = df)
good_model <- glm(y ~ x, family = gaussian(link = 'log'), data = df)
# this is bad
summary(bad_model)
#>
#> Call:
#> glm(formula = y ~ exp(x), family = gaussian(link = "identity"),
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.7143 -2.9643 -0.8571 3.0357 10.2857
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.714e+00 2.437e+00 3.986 0.00723 **
#> exp(x) -3.372e-28 4.067e-28 -0.829 0.43881
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 41.57135)
#>
#> Null deviance: 278.00 on 7 degrees of freedom
#> Residual deviance: 249.43 on 6 degrees of freedom
#> AIC: 56.221
#>
#> Number of Fisher Scoring iterations: 2
# this is better
summary(good_model)
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "log"), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.745 -2.600 0.046 1.812 6.080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.93579 0.51361 7.663 0.000258 ***
#> x -0.05663 0.02054 -2.757 0.032997 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 12.6906)
#>
#> Null deviance: 278.000 on 7 degrees of freedom
#> Residual deviance: 76.143 on 6 degrees of freedom
#> AIC: 46.728
#>
#> Number of Fisher Scoring iterations: 6
From here, you can reproduce what geom_smooth is going to do: make a sequence of x values across the domain and use the predictions as the y values for the line:
# new data is a sequence across the domain of the model
new_df <- data.frame(x = seq(min(df$x), max(df$x), length = 501))
# `type = 'response'` because we want values for y back in y units
new_df$bad_pred <- predict(bad_model, newdata = new_df, type = 'response')
new_df$good_pred <- predict(good_model, newdata = new_df, type = 'response')
library(tidyr)
library(ggplot2)
new_df %>%
# reshape to long form for ggplot
gather(model, y, contains('pred')) %>%
ggplot(aes(x, y)) +
geom_line(aes(color = model)) +
# plot original points on top
geom_point(data = df)
Of course, it's a lot easier to let ggplot handle all that for you:
ggplot(df, aes(x, y)) +
geom_smooth(
method = 'glm',
formula = y ~ x,
method.args = list(family = gaussian(link = 'log'))
) +
geom_point()

Writing a function to enclose another function with regression models [duplicate]

This question already has answers here:
Novice needs to loop lm in R
(1 answer)
R Loop for Variable Names to run linear regression model
(2 answers)
Closed 4 years ago.
Goal: Run three regression models with three different outcome variables, as seen below, but ideally in a more efficient way than seen in the model1, model2, model3 version seen in the last three lines.
Specific question: How can I write a function that iterates over the set of dv's and creates model + # indicator as an object (e.g. model1, model2, etc.) AND switches the dv (e.g. dv1, dv2, etc...)? I assume there is a forloop and function solution to this but I am not getting it...
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(dv, df) {
lm(dv ~ iv1 + iv2 + iv3, data = df)
}
model1 <- mymodel(dv = mydf$dv1, df = mydf)
model2 <- mymodel(dv = mydf$dv2, df = mydf)
model3 <- mymodel(dv = mydf$dv3, df = mydf)
Here's another approach using the tidyverse packages, since dplyr has more or less supplanted plyr.
library(tidyverse)
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(df) {
lm(value ~ iv1 + iv2 + iv3, data = df)
}
mydf %>%
gather("variable","value", contains("dv")) %>%
split(.$variable) %>%
map(mymodel)
#> $dv1
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.04516 -0.04657 0.08045 0.02518
#>
#>
#> $dv2
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.03906 0.16730 0.10324 0.02500
#>
#>
#> $dv3
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> 0.018492 -0.162563 0.002738 0.179366
Created on 2018-11-26 by the reprex package (v0.2.1)
You could convert your data.frame to long form, with all the dv values in one column and then use plyr's dlply to create the lms. This splits the data.frame on the specified column ("dvN") and applys the function to each and returns a list of lms. I have changed the function slightly to make it just take a data.frame, not the column separately.
Hope this gives what you need.
library(plyr)
library(tidyr)
mydf_l <- gather(mydf, dvN, Value, 1:3)
mymodel2 <- function(df) {
lm(Value ~ iv1 + iv2 + iv3, data = df)
}
allmodels <- dlply(mydf_l, .(dvN), mymodel2)

Suppress fixed effects coefficients in R

is there a way to suppress the coefficients for fixed effects in a linear model when using the summary() function (e.g. the equivalent of the absorb() function in stata). For example, I would like to have the summary function output just the intercept and x rather than the coefficients and standard errors for the factors as well:
frame <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(c("A", "B", "C", "D"),25))
summary(lm(y~x + as.factor(z), data = frame))
Call:
lm(formula = y ~ x + as.factor(z), data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.2417 -0.6407 0.1783 0.5789 2.4347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.25829 0.19196 -1.346 0.1816
x 0.09983 0.09788 1.020 0.3104
...
Thanks.
You could do any of these:
mod <- lm(y~x + as.factor(z), data = frame)
coef(mod)[c("(Intercept)", "x")]
# (Intercept) x
# 0.12357491 -0.06430765
coef(mod)[grepl("Intercept|x", names(coef(mod)))]
# (Intercept) x
# 0.12357491 -0.06430765
coef(mod)[1:2]
# (Intercept) x
# 0.12357491 -0.06430765
mod$coefficients[1:2]
# (Intercept) x
# 0.12357491 -0.06430765
I can highly recommend the stargazer package:
frame <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(c("A", "B", "C", "D"),25))
summary(lm(y~x + as.factor(z), data = frame))
model <- lm(y~x + as.factor(z), data = frame)
library(stargazer)
stargazer(model, type="text")
stargazer(model, omit = "z", type="text")

Resources