Offseting GLM Factor R - Standard Errors - r

Using the built-in mtcars data set I've just run the following bit of code:
my_mtcars <- mtcars
my_mtcars$cyl <- as.factor(my_mtcars$cyl)
my_mtcars$gear <- as.factor(my_mtcars$gear)
Fit a simple GLM: mpg ~ cyl + gear
my_glm <- glm(mpg ~ cyl + gear, data = my_mtcars)
summary(my_glm)
I want to offset cyl. If I set offset to be the same as the glm parameter estimates, should get same model in respect of gear
my_mtcars$cyl_offset <- case_when(
my_mtcars$cyl == 4 ~ 0,
my_mtcars$cyl == 6 ~ -6.656,
my_mtcars$cyl == 8 ~ -10.542
)
Fit same model but use offset instead of normal term
my_glm <- glm(mpg ~ offset(cyl_offset) + gear, data = my_mtcars)
summary(my_glm)
I do indeed get the same parameter estimates (intercept + gear), and the same residual deviance, but smaller standard errors. I wasn't expecting that - should I have been?

Related

Multiple Linear Regression with character as dependent variable

I'm currently trying do perform a multiple linear regression on the voter turnout per state within the 2020 Presidential Election.
To create this regression model I would like to use the following variables: State, Total_Voters and Population.
When I try to run my linear regression I get the following error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
The dataset I've gathered is quite large. I have created a new dataframe with the variables which I need as follows:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
After that I remove all NA values:
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
After that I filter through the dataframe once more and filter out all the states which did not report:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
In the end the dataframe looks like this:
With the following summary:
However when I now try to run my multiple linear regression I get the error I have showcased above. The command looks like this:
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
I'm quite new to linear regressions but I'm eager to learn. I have looked through StackOverflow for quite a bit now, and couldn't figure it out.
It would be greatly appreciated if someone here would be able to assist me.
The full script at once:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
# Change all NA to 0
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
summary(Turnout_Rate_2020)
# Select all again and filter out states which did not report. (values that were NA)
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
# Does not work and if I turn variables around I get NaN values.
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
summary(lmTurnoutRate_2020)
# Does not work
ggplot(lmTurnoutRate_2020, aes(x=State_Full,y=Population)) + geom_point() + geom_smooth(method=lm, level=0.95) + labs(x = "State", y = "Voters")
1) The input is missing from the question so we will use mtcars and make cyl a character column. lm cannot handle that but we could create a 0/1 model matrix from cyl and run that. This performs a separate lm for each level of cyl. This would only be applicable if the dependent variable had a small number of levels as we have here. If your dependent variable is naturally or has been cut into a small number of levels that would be the situation.
(Probably in this case we want to use logistic regression as with glm and family=binomial() or ordinal logistic regression as with polr in MASS or the ordinal package or multinom in nnet package but we will show it with lm just to show it can be done although it probably shouldn't be because with only two values the dependent variable is not sufficiently gaussian.)
mtcars2 <- transform(mtcars, cyl = as.character(cyl))
lm(model.matrix(~ cyl + 0) ~ hp, mtcars2)
giving:
Call:
lm(formula = model.matrix(~cyl + 0) ~ hp, data = mtcars2)
Coefficients:
cyl4 cyl6 cyl8
(Intercept) 1.052957 0.390688 -0.443645
hp -0.004835 -0.001172 0.006007
With polr (which assumes the levels are ordered as they are with cyl):
library(MASS)
polr(cyl ~ hp, transform(mtcars2, cyl = factor(cyl)))
giving:
Call:
polr(formula = cyl ~ hp, data = transform(mtcars2, cyl = factor(cyl)))
Coefficients:
hp
0.1156849
Intercepts:
4|6 6|8
12.32592 17.25331
Residual Deviance: 20.35585
AIC: 26.35585
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The other possibility is that your dependent variable just happens to be represented as character because of how it was created but could be numeric if one used as.numeric(...) on it. We can't tell without the input but using our example we can do this although again it is likely inappropriate because cyl has only 3 values and so does not approximate a gaussian closely enough. Your data may be different though.
lm(cyl ~ hp, transform(mtcars2, cyl = as.numeric(cyl)))
giving:
Call:
lm(formula = cyl ~ hp, data = transform(mtcars2, cyl = as.numeric(cyl)))
Coefficients:
(Intercept) hp
3.00680 0.02168

using "at" argument of margins function in R for logit model

I want to be able to analyze the marginal effect of continuous and binary variables in a logit model. I am hoping for R to provide what the independent marginal effect of hp is at its mean (in this example that is at 200), while also finding the marginal effect of the vs variable equaling 1. I am hoping the output table also includes the SE, p value, and z score. I am having trouble with the table and when I have gotten it to run it doesn't evaluate the two variables independently. Here is an MRE below. Thank you!
mod2 <- glm(am ~ hp + factor(vs), data=mtcars, family=binomial)
margins(mod2)
#> Average marginal effects
#> glm(formula = am ~ hp + factor(vs), family = binomial, data = mtcars)
#> hp vs1
#> -0.00203 -0.03154
#code where I am trying to evaluate at the desired values.
margins(mod2, at=list(hp=200, vs=1))
This is because you've changed vs to a factor.
Consider the following
library(margins)
mod3 <- glm(am ~ hp + vs, data=mtcars, family=binomial)
margins(mod3, at=list(hp=200, vs=1))
# Average marginal effects at specified values
# glm(formula = am ~ hp + vs, family = binomial, data = mtcars)
#
# at(hp) at(vs) hp vs
# 200 1 -0.001783 -0.02803
There is no real reason to turn vs into a factor here; it's dichotomous.

How to run a linear regression using lm() on a subset in R after multiple imputation using MICE

I want to run a linear regression analysis on my multiple imputed data. I imputed my dataset using mice. The formula I used to run a linear regression on my whole imputed set is as follows:
mod1 <-with(imp, lm(outc ~ age + sex))
pool_mod1 <- pool(mod1)
summary(pool_mod1)
This works fine. Now I want to create a subset of BMI, by saying: I want to apply this regression analysis to the group of people with a BMI below 30 and to the group of people with a BMI above or equal to 30. I tried to do the following:
mod2 <-with(imp, lm(outc ~ age + sex), subset=(bmi<30))
pool_mod2 <- pool(mod2)
summary(pool_mod2)
mod3 <-with(imp, lm(outc ~ age + sex), subset=(bmi>=30))
pool_mod3 <- pool(mod3)
summary(pool_mod3)
I do not get an error, but the problem is: all three analysis give me exactly the same results. I thought this could be just the real life situation, however, if I use variables other than bmi (like blood pressure < 150), the same thing happens to me.
So my question is: how can I do subset analysis in R when the data is imputed using mice?
(BMI is imputed as well, I do not know if that is a problem?)
You should place subset within lm(), not outside of it.
with(imp, lm(outc ~ age + sex, subset=(bmi<30)))
A reproducible example.
with(mtcars, lm(mpg ~ disp + hp)) # Both produce the same
with(mtcars, lm(mpg ~ disp + hp), subset=(cyl < 6))
Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484
with(mtcars, lm(mpg ~ disp + hp, subset=(cyl < 6))) # Calculates on the subset
Coefficients:
(Intercept) disp hp
43.04006 -0.11954 -0.04609

Using column name of dataframe as predictor variable in linear regression

I'm trying to loop through all the column names of my data.frame and use them
as predictor variable in a linear regression.
What I currently have is:
for (i in 1:11){
for (j in 1:11){
if (i != j ){
var1 = names(newData)[i]
var2 = names(newData)[j]
glm.fit = glm(re78 ~ as.name(var1):as.name(var2), data=newData)
summary(glm.fit)
cv.glm(newData, glm.fit, K = 10)$delta[1]
}
}
}
Where newData is my data.frame and there are 11 columns in total. This code gives me the following error:
Error in model.frame.default(formula = re78 ~ as.name(var1), data = newData, :
invalid type (symbol) for variable 'as.name(var1)'
How can I fix this, and make it work?
It looks like you want models that use all combinations of two variables. Here's another way to do that using the built-in mtcars data frame for illustration and using mpg as the outcome variable.
We get all combinations of two variables (excluding the outcome variable, mpg in this case) using combn. combn returns a list where each list element is a vector containing the names of a pair of variables. Then we use map (from the purrr package) to create models for each pair of variables and store the results in a list.
We use reformulate to construct the model formula. .x refers back to the vectors of variables names (each element of vars). If you run, for example, reformulate(paste(c("cyl", "disp"),collapse="*"), "mpg"), you can see what reformulate is doing.
library(purrr)
# Get all combinations of two variables
vars = combn(names(mtcars)[-grep("mpg", names(mtcars))], 2, simplify=FALSE)
Now we want to run regression models on all pairs of variables and store results in a list:
# No interaction
models = map(vars, ~ glm(reformulate(.x, "mpg"), data=mtcars))
# Interaction only (no main effects)
models = map(vars, ~ glm(reformulate(paste(.x, collapse=":"), "mpg"), data=mtcars))
# Interaction and main effects
models = map(vars, ~ glm(reformulate(paste(.x, collapse="*"), "mpg"), data=mtcars))
Name each list element with the formula for that model:
names(models) = map(models, ~ .x[["terms"]])
To create the model formulas using paste instead of reformulate you could do (change + to : or *, depending on what combination of interactions and main effects you want to include):
models = map(vars, ~ glm(paste("mpg ~", paste(.x, collapse=" + ")), data=mtcars))
To see how paste is being used here, you can run:
paste("mpg ~", paste(c("cyl", "disp"), collapse=" * "))
Here's what the first two models look like when the models include both main effects and the interaction:
models[1:2]
$`mpg ~ cyl * disp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl disp cyl:disp
49.03721 -3.40524 -0.14553 0.01585
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 198.1 AIC: 159.1
$`mpg ~ cyl * hp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl hp cyl:hp
50.75121 -4.11914 -0.17068 0.01974
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 247.6 AIC: 166.3
To assess model output, you can use functions from the broom package. The code below returns data frames with, respectively, the coefficients and performance statistics for each model.
library(broom)
model_coefs = map_df(models, tidy, .id="Model")
model_performance = map_df(models, glance, .id="Model")
Here are what the results look like for models with both main effects and the interaction:
head(model_coefs, 8)
Model term estimate std.error statistic p.value
1 mpg ~ cyl * disp (Intercept) 49.03721186 5.004636297 9.798357 1.506091e-10
2 mpg ~ cyl * disp cyl -3.40524372 0.840189015 -4.052950 3.645320e-04
3 mpg ~ cyl * disp disp -0.14552575 0.040002465 -3.637919 1.099280e-03
4 mpg ~ cyl * disp cyl:disp 0.01585388 0.004947824 3.204212 3.369023e-03
5 mpg ~ cyl * hp (Intercept) 50.75120716 6.511685614 7.793866 1.724224e-08
6 mpg ~ cyl * hp cyl -4.11913952 0.988229081 -4.168203 2.672495e-04
7 mpg ~ cyl * hp hp -0.17068010 0.069101555 -2.469989 1.987035e-02
8 mpg ~ cyl * hp cyl:hp 0.01973741 0.008810871 2.240120 3.320219e-02
You can use fit <- glm(as.formula(paste0("re78 ~ ", var1)), data=newData) as #akrun suggest. Further, you likely do not want to call your object glm.fit as there is a function with the same.
Caveat: I do not why you have the double loop and the :. Do you not want a regression with a single covaraite? I have no idea what you are trying to achieve otherwise.

Linear regression with interaction fails in the rms-package

I'm playing around with interaction in the formula. I wondered if it's possible to do a regression with interaction for one of the two dummy variables. This seems to work in regular linear regression using the lm() function but with the ols() function in the rms package the same formula fails. Anyone know why?
Here's my example
data(mtcars)
mtcars$gear <- factor(mtcars$gear)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
summary(regular_lm)
regular_lm <- lm(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
summary(regular_lm)
And now the rms example
library(rms)
dd <- datadist(mtcars)
options(datadist = "dd")
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:gear, data=mtcars)
regular_ols
# Fails with:
# Error in if (!length(fname) || !any(fname == zname)) { :
# missing value where TRUE/FALSE needed
regular_ols <- ols(mpg ~ wt + cyl + gear + cyl:I(gear == "4"), data=mtcars)
This experiment might not be the wisest statistic to do as it seems that the estimates change significantly but I'm a little curious to why ols() fails since it should do the "same fitting routines used by lm"
I don't know exactly, but it has to do with the way the formula is evaluated rather than with the way the fit is done once the model has been translated. Using traceback() shows that the problem occurs within Design(eval.parent(m)); using options(error=recover) gets you to the point where you can see that
Browse[1]> fname
[1] "wt" "cyl" "gear"
Browse[1]> zname
[1] NA
in other words, zname is some internal variable that hasn't been set right because the Design function can't quite handle defining the interaction between cylinders and the (gear==4) dummy on the fly.
This works though:
mtcars$cylgr <- with(mtcars,interaction(cyl,gear == "4"))
regular_ols <- ols(mpg ~ wt + cyl + gear + cylgr, data=mtcars)

Resources