Multiple Linear Regression with character as dependent variable - r

I'm currently trying do perform a multiple linear regression on the voter turnout per state within the 2020 Presidential Election.
To create this regression model I would like to use the following variables: State, Total_Voters and Population.
When I try to run my linear regression I get the following error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
The dataset I've gathered is quite large. I have created a new dataframe with the variables which I need as follows:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
After that I remove all NA values:
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
After that I filter through the dataframe once more and filter out all the states which did not report:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
In the end the dataframe looks like this:
With the following summary:
However when I now try to run my multiple linear regression I get the error I have showcased above. The command looks like this:
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
I'm quite new to linear regressions but I'm eager to learn. I have looked through StackOverflow for quite a bit now, and couldn't figure it out.
It would be greatly appreciated if someone here would be able to assist me.
The full script at once:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
# Change all NA to 0
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
summary(Turnout_Rate_2020)
# Select all again and filter out states which did not report. (values that were NA)
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
# Does not work and if I turn variables around I get NaN values.
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
summary(lmTurnoutRate_2020)
# Does not work
ggplot(lmTurnoutRate_2020, aes(x=State_Full,y=Population)) + geom_point() + geom_smooth(method=lm, level=0.95) + labs(x = "State", y = "Voters")

1) The input is missing from the question so we will use mtcars and make cyl a character column. lm cannot handle that but we could create a 0/1 model matrix from cyl and run that. This performs a separate lm for each level of cyl. This would only be applicable if the dependent variable had a small number of levels as we have here. If your dependent variable is naturally or has been cut into a small number of levels that would be the situation.
(Probably in this case we want to use logistic regression as with glm and family=binomial() or ordinal logistic regression as with polr in MASS or the ordinal package or multinom in nnet package but we will show it with lm just to show it can be done although it probably shouldn't be because with only two values the dependent variable is not sufficiently gaussian.)
mtcars2 <- transform(mtcars, cyl = as.character(cyl))
lm(model.matrix(~ cyl + 0) ~ hp, mtcars2)
giving:
Call:
lm(formula = model.matrix(~cyl + 0) ~ hp, data = mtcars2)
Coefficients:
cyl4 cyl6 cyl8
(Intercept) 1.052957 0.390688 -0.443645
hp -0.004835 -0.001172 0.006007
With polr (which assumes the levels are ordered as they are with cyl):
library(MASS)
polr(cyl ~ hp, transform(mtcars2, cyl = factor(cyl)))
giving:
Call:
polr(formula = cyl ~ hp, data = transform(mtcars2, cyl = factor(cyl)))
Coefficients:
hp
0.1156849
Intercepts:
4|6 6|8
12.32592 17.25331
Residual Deviance: 20.35585
AIC: 26.35585
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The other possibility is that your dependent variable just happens to be represented as character because of how it was created but could be numeric if one used as.numeric(...) on it. We can't tell without the input but using our example we can do this although again it is likely inappropriate because cyl has only 3 values and so does not approximate a gaussian closely enough. Your data may be different though.
lm(cyl ~ hp, transform(mtcars2, cyl = as.numeric(cyl)))
giving:
Call:
lm(formula = cyl ~ hp, data = transform(mtcars2, cyl = as.numeric(cyl)))
Coefficients:
(Intercept) hp
3.00680 0.02168

Related

Forest plot facet grid comparing regression model coefficients from multiple models

I am currently working with 30 datasets with the same column names, but different numeric data. I need to apply a linear mixed model and a generalised linear model to every instance of the dataset and plot the resulting fixed effect coefficients on a forest plot.
The data is currently structured as follows (using the same dataset for every list element for simplicity):
library(lme4)
data_list <- list()
# There's definitely a better way of doing this through lapply(), I just can't figure out how
for (i in 1:30){
data_list[[i]] <- tibble::as_tibble(mtcars) # this would originally load different data at every instance
}
compute_model_lmm <- function(data){
lmer("mpg ~ hp + disp + drat + (1|cyl)", data = data)
}
result_list_lmm <- lapply(data_list, compute_model_lmm)
What I am currently doing is
library(modelsummary)
modelplot(result_list_lmm)+
facet_wrap(~model) #modelplot() takes arguments/functions from ggplot2
which takes an awful amount of time, but it works.
Now, I would like to compare another model on the same plot, as in
compute_model_glm <- function(data){
glm("mpg ~ hp + disp + drat + cyl", data = data)
}
result_list_glm <- lapply(data_list, compute_model_glm)
modelplot(list(result_list_lmm[[1]], result_list_glm[[1]]))
but for every instance of the plot.
How do I specify it to modelplot()?
Thanks in advance!
The modelplot function gives you a few basic ways of plotting coefficients and intervals (check the facet argument, for example).
However, the real power of the function comes by using the draw=FALSE argument. In that case, modelplot does the hard job of giving you the estimates in a convenient data frame, with all the renaming, robust standard errors, and other utilities of the modelplot function. Then, you can use that data frame to do the plotting yourself with ggplot2 for infinite customization.
library(modelsummary)
library(ggplot2)
results_lm <- lapply(1:10, function(x) lm(hp ~ mpg, data = mtcars)) |>
modelplot(draw = FALSE) |>
transform("Function" = "lm()")
results_glm <- lapply(1:10, function(x) glm(hp ~ mpg, data = mtcars)) |>
modelplot(draw = FALSE) |>
transform("Function" = "glm()")
results <- rbind(results_lm, results_glm)
head(results)
term model estimate std.error conf.low conf.high Function
1 (Intercept) Model 1 324.0823 27.4333 268.056 380.1086 lm()
3 (Intercept) Model 2 324.0823 27.4333 268.056 380.1086 lm()
5 (Intercept) Model 3 324.0823 27.4333 268.056 380.1086 lm()
7 (Intercept) Model 4 324.0823 27.4333 268.056 380.1086 lm()
9 (Intercept) Model 5 324.0823 27.4333 268.056 380.1086 lm()
11 (Intercept) Model 6 324.0823 27.4333 268.056 380.1086 lm()
ggplot(results, aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high)) +
geom_pointrange(aes(color = Function), position = position_dodge(width = .5)) +
facet_wrap(~model)

using "at" argument of margins function in R for logit model

I want to be able to analyze the marginal effect of continuous and binary variables in a logit model. I am hoping for R to provide what the independent marginal effect of hp is at its mean (in this example that is at 200), while also finding the marginal effect of the vs variable equaling 1. I am hoping the output table also includes the SE, p value, and z score. I am having trouble with the table and when I have gotten it to run it doesn't evaluate the two variables independently. Here is an MRE below. Thank you!
mod2 <- glm(am ~ hp + factor(vs), data=mtcars, family=binomial)
margins(mod2)
#> Average marginal effects
#> glm(formula = am ~ hp + factor(vs), family = binomial, data = mtcars)
#> hp vs1
#> -0.00203 -0.03154
#code where I am trying to evaluate at the desired values.
margins(mod2, at=list(hp=200, vs=1))
This is because you've changed vs to a factor.
Consider the following
library(margins)
mod3 <- glm(am ~ hp + vs, data=mtcars, family=binomial)
margins(mod3, at=list(hp=200, vs=1))
# Average marginal effects at specified values
# glm(formula = am ~ hp + vs, family = binomial, data = mtcars)
#
# at(hp) at(vs) hp vs
# 200 1 -0.001783 -0.02803
There is no real reason to turn vs into a factor here; it's dichotomous.

How to run a linear regression using lm() on a subset in R after multiple imputation using MICE

I want to run a linear regression analysis on my multiple imputed data. I imputed my dataset using mice. The formula I used to run a linear regression on my whole imputed set is as follows:
mod1 <-with(imp, lm(outc ~ age + sex))
pool_mod1 <- pool(mod1)
summary(pool_mod1)
This works fine. Now I want to create a subset of BMI, by saying: I want to apply this regression analysis to the group of people with a BMI below 30 and to the group of people with a BMI above or equal to 30. I tried to do the following:
mod2 <-with(imp, lm(outc ~ age + sex), subset=(bmi<30))
pool_mod2 <- pool(mod2)
summary(pool_mod2)
mod3 <-with(imp, lm(outc ~ age + sex), subset=(bmi>=30))
pool_mod3 <- pool(mod3)
summary(pool_mod3)
I do not get an error, but the problem is: all three analysis give me exactly the same results. I thought this could be just the real life situation, however, if I use variables other than bmi (like blood pressure < 150), the same thing happens to me.
So my question is: how can I do subset analysis in R when the data is imputed using mice?
(BMI is imputed as well, I do not know if that is a problem?)
You should place subset within lm(), not outside of it.
with(imp, lm(outc ~ age + sex, subset=(bmi<30)))
A reproducible example.
with(mtcars, lm(mpg ~ disp + hp)) # Both produce the same
with(mtcars, lm(mpg ~ disp + hp), subset=(cyl < 6))
Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484
with(mtcars, lm(mpg ~ disp + hp, subset=(cyl < 6))) # Calculates on the subset
Coefficients:
(Intercept) disp hp
43.04006 -0.11954 -0.04609

Setting Different Levels of constants for categorical variables in R

Will anyone be able to explain how to set constants for different levels of categorical variables in r?
I have read the following: How to set the Coefficient Value in Regression; R and it does a good job for explaining how to set a constant for the whole of a categorical variable. I would like to know how to set one for each level.
As an example, let us look at the MTCARS dataset:
df <- as.data.frame(mtcars)
df$cyl <- as.factor(df$cyl)
set.seed(1)
glm(mpg ~ cyl + hp + gear, data = df)
This gives me the following output:
Call: glm(formula = mpg ~ cyl + hp + gear, data = df)
Coefficients:
(Intercept) cyl6 cyl8 hp gear
19.80268 -4.07000 -2.29798 -0.05541 2.79645
Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
Null Deviance: 1126
Residual Deviance: 219.5 AIC: 164.4
If I wanted to set cyl6 to -.34 and cyl8 to -1.4, and then rerun to see how it effects the other variables, how would I do that?
I think this is what you can do
df$mpgCyl=df$mpg
df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]-0.34
df$mpgCyl[df$cyl==8]=df$mpgCyl[df$cyl==8]-1.4
model2=glm(mpgCyl ~ hp + gear, data = df)
> model2
Call: glm(formula = mpgCyl ~ hp + gear, data = df)
Coefficients:
(Intercept) hp gear
16.86483 -0.07146 3.53128
UPDATE withe comments:
cyl is a factor, therefore by default it contributes to glm as offset, not slope. Actually cyl==4 is 'hidden' but existing in the glm as well. So in your first glm what the models says is:
1) for cyl==4: mpg=19.8-0.055*hp+2.79*gear
2) for cyl==6: mpg=(19.8-4.07)-0.055*hp+2.79*gear
3) for cyl==8: mpg=(19.8-2.29)-0.055*hp+2.79*gear
Maybe you can also check here https://stats.stackexchange.com/questions/213710/level-of-factor-taken-as-intercept and here Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?
Hope this helps

how to interpret coefficient in regression with two categorical variables (unordered or ordered factors)

I'm a little confused about how to interpret coefficient in multiple regression with two categorical variables. Use mtcars dataset as an example. According to some online sources and books, the coefficient of one categorical variable is the different of mean between the level and reference level, given the other variable is at reference level. In this example, according to the aggregated result, the coefficient of factor(vs)1 should be 81-91=-10, but it's not. It's -13.92. Those claims seems to be wrong.
Can someone clarify one on this? How to interpret the coefficients in terms of 'mean difference'?
fit <- lm(data=df, hp~factor(vs)+factor(cyl))
Call:
lm(formula = hp ~ factor(vs) + factor(cyl), data = df)
Coefficients:
(Intercept) factor(vs)1 factor(cyl)6 factor(cyl)8
95.29 -13.92 34.95 113.93
# then mean of hp at different levels of vs ans cyl.
aggregate(hp~vs+cyl, df, mean)
0 4 91.0000
1 4 81.8000
0 6 131.6667
1 6 115.2500
0 8 209.2143
My second question is:
what if the treat those categorical variable as ordered factors? There will be linear or quadratic term for those factors. But how should I interpret the coefficients?
lm(data=df, hp~factor(vs, ordered=TRUE)+factor(cyl, ordered=TRUE))
Call:
lm(formula = hp ~ factor(vs, ordered = TRUE) + factor(cyl, ordered = TRUE),
data = df)
Coefficients:
(Intercept) factor(vs, ordered = TRUE).L
137.96 -9.84
factor(cyl, ordered = TRUE).L factor(cyl, ordered = TRUE).Q
80.56 17.97
Thank you very much in advance.
Regarding the first question, if
cyl is at its reference level and vs is at the 1 level then the mean they are referring to is 95.29 - 13.92 + 0 and when
vs and cyl are both at the reference level the mean is 95.29 + 0 + 0
so -13.92 is the difference between those two means.
By mean they are referring to the expected value of y which is estimated by the predicted value. If we write the regression equation as y = terms + residuals then the expected value of y equals the terms, i.e.
E(y) = E(terms + residuals)
= E(terms) + E(residuals)
= terms + 0 <- because terms is not random and residuals have mean 0
= terms
Regarding the second question which asks about ordered factors they are rarely used and I would ignore their existence for linear models. In the book Introductory Statistics with R by Peter Dalgaard, he mentions that the implementation in R assumes that the levels are equidistant. Such assumption is questionable in general.

Resources