R: modeling on residuals - r

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.
I'm interested in the following:
If I'm using a glm, how do I account for the link function used?
What distribution do I choose when running a second glm with var_2 as explanatory variable?
I assume this is related to 1.
Is this at all related to using the first models prediction as an offset in the second model?
My attempt:
dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]
# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]
# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]
My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!

In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.
The following regression is equivalent to this ANCOVA:
lm(y ~ treatment * scale(covariate, scale = FALSE))
Which applied to the data would look like this:
lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)
And can be turned into a glm similar to the one you use in your example:
glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE),
family=Gamma(link="log"),
data = mtcars)

Related

Converting logistic regression output from log odds to probability

I initially made this model for a class. Looking back at it, I found that, when I tried to convert my logistic regression output to probability, I got values greater than 1. I am using the following dataset: https://stats.idre.ucla.edu/stat/data/binary.csv
My code to set up the model:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank<- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data=mydata, family="binomial")
summary(mylogit)
Now, I exponentiate these coefficients to get my odds("odds"):
odds <- exp(coef(mylogit))
and convert the odds to probability:
odds/(1 + odds)
# (Intercept) gre gpa rank2 rank3 rank4
# 0.01816406 0.50056611 0.69083749 0.33727915 0.20747653 0.17487497
This output does not make sense; probability must be less than 1, and if GRE is 300, GPA is 3, and rank2 is true (all reasonable possibilities), then probability would be much more than 1.
What is my mistake here? What would be the correct way to convert this to probability?
Your formula p/(1+p) is for the odds ratio, you need the sigmoid function
You need to sum all the variable terms before calculating the sigmoid function
You need to multiply the model coefficients by some value, otherwise you are assuming all the x's are equal to 1
Here is an example using mtcars data set
mod <- glm(vs ~ mpg + cyl + disp, mtcars, family="binomial")
z <- coef(mod)[1] + sum(coef(mod)[-1]*mtcars[1, c("mpg", "cyl", "disp")])
1/(1 + exp(-z))
# 0.3810432
which we can verify using
predict(mod, mtcars[1, c("mpg", "cyl", "disp")], type="response")
# 0.3810432

fixest vs lm - diffrent results? (difference in difference)

I'm trying to do a 'classic' difference in difference with multiple time periods. The model I want to do is:
y = a + b1x1 + b2_treat + b3_period + b_4(treat*period) + u (eq.1)
So basically I'm testing different setups just to make sure I specify my model in the right way, using different packages. I want to use the fixest-package, so I tried to compare the estimates with the estimates from the standard lm()-package. The results, however, differ -- both coefficients and std.errors.
My question is:
Is either the lm_mod, lm_mod2 or the feols_mod regressions specified correctly (as in eq.1)?
If not, I would appreciate it if anyone can show me how to get the same results in lm() as in feols()!
# libraries
library(fixest)
library(modelsummary)
library(tidyverse)
# load data
data(base_did)
# make df for lm_mod with 5 as the reference-period
base_ref_5 <- base_did %>%
mutate(period = as.factor(period)) %>%
mutate(period = relevel(period, ref = 5))
# Notice that i use base_ref_5 for the lm model and base_did for the feol_mod.
lm_mod <- lm(y ~ x1 + treat*period, base_ref_5)
lm_mod2 <- lm(y ~ x1 + treat + period + treat*period, base_ref_5)
feols_mod <- feols(y ~ x1 + i(period, treat, ref = 5), base_did)
# compare models
models <- list("lm" = lm_mod,
"lm2" = lm_mod2,
"feols" = feols_mod)
msummary(models, stars = T)
**EDIT:**
the reason why I created base_ref_5 was so that both regressions would have period 5 as the reference period, if that was unclear.
**EDIT 2**:
added a third model (lm_mod2) which is much closer, but there is still a difference.
There are two issues, here.
In the lm() model, the period variable is interacted, but treated as a continuous numeric variable. In contrast, calling i(period, treat) treats period as a factor (this is explained clearly in the documentation).
The i() function only includes the interactions, and not the constitutive terms.
Here are two models to illustrate the parallels:
library(fixest)
data(base_did)
lm_mod <- lm(y ~ x1 + factor(period) * factor(treat), base_did)
feols_mod <- feols(y ~ x1 + factor(period) + i(period, treat), base_did)
coef(lm_mod)["x1"]
#> x1
#> 0.9799697
coef(feols_mod)["x1"]
#> x1
#> 0.9799697
Please note that I only answered the part of your question about parallels between lm and feols. StackOverflow is a programming Q&A site. If you have questions about the proper specification of a statistical model, you might want to ask on CrossValidated.

Getting the same results for two different models in glm() in RStudio

I'm new to the R tool and am having some trouble with the glm() function.
I have some data that I have showed below. When the linear predictor is just x, the glm() function works fine but as soon as I change the linear predictor to x + x^2, it starts giving me the same results that I got for the first model.
The code is as follows:
model1 <- glm(y ~ x, data=data1, family=poisson (link="log"))
coef(model1)
(Intercept) x
0.3396339 0.2565236
model2 <- glm(y ~ x + x^2, data=data1, family=poisson (link="log"))
coef(model2)
(Intercept) x
0.3396339 0.2565236
As you can see there's no coefficient for x^2 as if it's not even in the model.
The lm and glm functions have a special interpretation of the formula (see ?formula) which can be confusing if you are not expecting it. The intended usage of the interface is (w + x)^2 means a*w + b*x + c*w*x + d! If you wish to suppress this you need to use the literal function, I.
model2 <- glm(gear ~ disp + I(disp^2),
data = mtcars, family = poisson (link = "log"))
coef(model2)
# (Intercept) disp I(disp^2)
# 1.542059e+00 -1.248689e-03 6.578518e-07
Put another way, I allows you to perform transformations in the call to glm. The following is equivalent.
mtcars1 <- mtcars
mtcars1$disp_sq <- mtcars1$disp^2
model2a <- glm(gear ~ disp + disp_sq,
data = mtcars1, family = poisson (link = "log"))
coef(model2a)
# (Intercept) disp disp_sq
# 1.542059e+00 -1.248689e-03 6.578518e-07

Naming conventions of predictor variables

Specific Examples:
log1 <- glm(Outcome ~ Predictor1 + Predictor2, family = binomial(link="logit"),
data=data)
log2 <- glm(data$Outcome ~ data$Predictor1 + data$Predictor2,
family = binomial(link="logit"))
These will produce the same models and their summaries will be identical.
Then why when using these models to predict an outcome from test data, do the values differ?
Example:
predict(log1,type = "response", newdata = test_dat) ==
predict(log2,type = "response", newdata = test_dat) = "FALSE"
I am not as familiar with R as I would like, but I can't seem to explain the differences. Help?
To compare two objects use identical(log1, log2) ; however, the problem is that the names are part of the objects so if the names are different then the objects cannot be identical even if all the numbers underlying them are the same.
For example, note how Time and BOD$Time are part of fm1 and fm2:
fm1 <- lm(demand ~ Time, BOD)
fm2 <- lm(BOD$demand ~ BOD$Time)
fm1[[1]]
## (Intercept) Time
## 8.521429 1.721429
fm2[[1]]
## (Intercept) BOD$Time
## 8.521429 1.721429

Intercept of random effect - library(sommer)

When I run this mixed model, I get all of the statistics I need.
library(sommer)
data(example)
#Model without intercept - OK
ans1 <- mmer2(Yield~Env,
random= ~ Name + Env:Name,
rcov= ~ units,
data=example, silent = TRUE)
summary(ans1)
ans1$u.hat #Random effects
However if I try to get the intercept to random effects, like in the R library lme4, I get a error like:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
#Model with intercept
ans2 <- mmer2(Yield~Env,
random= ~ 1+Name + Env:Name,
rcov= ~ units,
data=example, silent = TRUE)
summary(ans2)
ans2$u.hat #Random effects
How can I overcome that?
Your model:
ans1 <- mmer2(Yield~Env,
random= ~ Name + Env:Name,
rcov= ~ units,
data=example, silent = TRUE)
is equivalent to:
ans1.lmer <- lmer(Yield~Env + (1|Name) + (1|Env:Name),
data=example)
using lme4. Please notice that lme4 uses the notation (x|y) to specify if there is for example different intercepts (x term) for each level of the second term (y term) which is a random regression model. If you specify:
ans2.lmer <- lmer(Yield~Env + (Env|Name),
data=example)
you get three variance components, one for each of the 3 levels in the Env term. The equivalent in sommer is not a random regression but a heterogeneous variance model using the diag() functionality:
ans2 <- mmer2(Yield~Env,
random= ~ diag(Env):Name,
rcov= ~ units,
data=example, silent = TRUE)
## or in sommer >=3.7
ans2 <- mmer(Yield~Env,
random= ~ vs(ds(Env),Name),
rcov= ~ units,
data=example, silent = TRUE)
The first 2 models above are equivalent because both models assume there's no different intercepts, whereas the last two models tackle the same problem but with two different approaches that are not exactly the same; random regression versus heterogeneous variance model.
In short, sommer doesn't have random regression implemented yet so you cannot use random intercepts in sommer like you do in lme4, but instead use a heterogeneous variance models.
Cheers,
I know it is not an elegant solution, but how about adding intercept to the data, so you can easily use it in the model?
What I mean is:
example <- cbind(example, inter=1)
ans2 <- mmer2(Yield~Env,
random= ~ Name + Env:Name + inter, #here inter are 1's
rcov= ~ units,
data=example, silent = TRUE)
summary(ans2)
ans2$u.hat

Resources