Forest plot facet grid comparing regression model coefficients from multiple models - r

I am currently working with 30 datasets with the same column names, but different numeric data. I need to apply a linear mixed model and a generalised linear model to every instance of the dataset and plot the resulting fixed effect coefficients on a forest plot.
The data is currently structured as follows (using the same dataset for every list element for simplicity):
library(lme4)
data_list <- list()
# There's definitely a better way of doing this through lapply(), I just can't figure out how
for (i in 1:30){
data_list[[i]] <- tibble::as_tibble(mtcars) # this would originally load different data at every instance
}
compute_model_lmm <- function(data){
lmer("mpg ~ hp + disp + drat + (1|cyl)", data = data)
}
result_list_lmm <- lapply(data_list, compute_model_lmm)
What I am currently doing is
library(modelsummary)
modelplot(result_list_lmm)+
facet_wrap(~model) #modelplot() takes arguments/functions from ggplot2
which takes an awful amount of time, but it works.
Now, I would like to compare another model on the same plot, as in
compute_model_glm <- function(data){
glm("mpg ~ hp + disp + drat + cyl", data = data)
}
result_list_glm <- lapply(data_list, compute_model_glm)
modelplot(list(result_list_lmm[[1]], result_list_glm[[1]]))
but for every instance of the plot.
How do I specify it to modelplot()?
Thanks in advance!

The modelplot function gives you a few basic ways of plotting coefficients and intervals (check the facet argument, for example).
However, the real power of the function comes by using the draw=FALSE argument. In that case, modelplot does the hard job of giving you the estimates in a convenient data frame, with all the renaming, robust standard errors, and other utilities of the modelplot function. Then, you can use that data frame to do the plotting yourself with ggplot2 for infinite customization.
library(modelsummary)
library(ggplot2)
results_lm <- lapply(1:10, function(x) lm(hp ~ mpg, data = mtcars)) |>
modelplot(draw = FALSE) |>
transform("Function" = "lm()")
results_glm <- lapply(1:10, function(x) glm(hp ~ mpg, data = mtcars)) |>
modelplot(draw = FALSE) |>
transform("Function" = "glm()")
results <- rbind(results_lm, results_glm)
head(results)
term model estimate std.error conf.low conf.high Function
1 (Intercept) Model 1 324.0823 27.4333 268.056 380.1086 lm()
3 (Intercept) Model 2 324.0823 27.4333 268.056 380.1086 lm()
5 (Intercept) Model 3 324.0823 27.4333 268.056 380.1086 lm()
7 (Intercept) Model 4 324.0823 27.4333 268.056 380.1086 lm()
9 (Intercept) Model 5 324.0823 27.4333 268.056 380.1086 lm()
11 (Intercept) Model 6 324.0823 27.4333 268.056 380.1086 lm()
ggplot(results, aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high)) +
geom_pointrange(aes(color = Function), position = position_dodge(width = .5)) +
facet_wrap(~model)

Related

How do I combine fitted models on imputed data into a usable model for new predictions?

I'm performing predictive analysis where I train a model to a portion of my data and test the model with the remaining portion. I'm familiar with the MICE package and the imputation procedure using predictive mean matching.
My understanding is that the proper way to utilize imputation is to create numerous imputed data sets, fit a model to each of those imputed data sets, then combine the coefficients across all of those fitted models into one single model. I know how to do this and view the summary of the coefficients with which I can perform inference on the variables. However, that is not my objective; I need to end up with a single model that I can use to predict new values.
Simply put, when I try to use the predict function with this model I got from using MICE, it doesn't work.
Any suggestions? I am coding this in R.
Edit: using the airquality data set as an example, my code looks like this:
imputed_data <- mice(airquality, method = c(rep("pmm", 6)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Ozone ~ Solar.R + Wind + Temp + Month + Day))
pooled_model <- pool(model)
This gives me a pooled model across my 5 imputed data sets. However, I am unable to use the predict function with this model. When I then execute:
predict(pooled_model, newdata = airquality)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mira', 'matrix')"
Not sure exactly what you're looking for, but something like this might work:
library(mice)
library(mitools)
data(mtcars)
mtcars$qsec[c(4,6,8,21)] <- NA
imps <- mice(mtcars, m=10)
comps <- lapply(1:imps$m, function(i)complete(imps, i))
mods <- lapply(comps, function(x)lm(qsec ~hp + drat + wt, data=x))
pmod <- MIcombine(mods)
pmod$coefficients
#> (Intercept) hp drat wt
#> 18.15389098 -0.02570887 0.11434023 0.92348390
newvals <- data.frame(hp=300, drat=4, wt=2.58)
X <- model.matrix(~hp + drat + wt, data=newvals)
preds <- X %*% pmod$coefficients
preds
#> [,1]
#> 1 13.28118
Created on 2023-02-01 by the reprex package (v2.0.1)

Should I provide x or log(x) to predict() if my model is y ~ log(x)?

I have a fitted lm model
log_log_model = lm(log(price) ~ log(carat), data = diamonds)`
I want to predict price using this model, but I'm not sure if I should be entering log(carat) or carat value as predictor into the predict() function?
Choice 1
exp(predict(log_log_model, data.frame(carat = log(3)),
interval = 'predict', level = 0.99))
Choice 2
exp(predict(log_log_model, data.frame(carat = 3),
interval = 'predict', level = 0.99))
Which one is correct?
Choice 2 is correct.
To give you some extra bit of confidence, let's inspect what the design matrix looks like when we make prediction.
## for diamonds dataset
library(ggplo2)
## log-log linear model
fit <- lm(log(price) ~ log(carat), data = diamonds)
## for prediction
newdat <- data.frame(data.frame(carat = 3))
## evaluate the design matrix for prediction
Xp <- model.matrix(delete.response(terms(fit)), data = newdat)
# (Intercept) log(carat)
#1 1 1.098612
See it? carat = 3 is automatically evaluated to log(carat) = log(3).

How to calculate confidence intervals for predictive margins/means of predicted values with a logistic regression model

I am not sure if this is a question for stackoverflow or crossvalidated as it contains both an R specific coding part and a general statistics part.
In a project we want to use predictive margins, or more general, means of predicted values. The model used for prediction is a logistic regression model. The data comes from multiple surveys conducted at different times without clusters and comes with weights for known characteristics in the population. The data is weighted for each timespan.
As the data will grow over time, we don't want to do the prediction from the data used for modelling but with a dataframe containing the information of the population.
Thanks to this answer I know how to calculate the confidence intervals for every observation.
However I want to get the mean propabilty for different groups. For the propabilities I can just calculate the mean of the propabilities, but how do I get the right confidence intervals?
It seems that marginpred and svypredmeans from the package survey by Thomas Lumley do some of the things I want, but don't allow to do the prediction on new data.
Here is some code and data to show the approach. Please consider that in my real use case the data I use for predictions is not the same I use for modelling.
#libraries
library(dplyr)
# Data for modelling
data(mtcars)
# Get a weight column because in my real use case the data has population weights
mtcars["weight"] <- runif(nrow(mtcars), 500, 1000)
numtofac <- c("cyl", "vs", "am", "gear", "carb")
mtcars[numtofac] <- lapply(numtofac, function(x) factor(mtcars[[x]]))
mtcars["mpg20"] <- ifelse(mtcars$mpg >=20, 1, 0)
# Get prediction data (standardizes for "disp")
# In my real use case the data for prediction is not created from my data for modelling
mtcars_vs0 <- mtcars
mtcars_vs0["vs"] <- factor(0, levels=c(0,1))
mtcars_vs1 <- mtcars
mtcars_vs1["vs"] <- factor(1, levels=c(0,1))
pred_mtcars <- rbind(mtcars_vs0, mtcars_vs1)[c("vs", "disp", "weight")]
# Logistic Regression Model
glm_mpg20 <- glm(mpg20 ~ vs*disp, family = binomial(link = "logit"), data=mtcars, weights = mtcars$weight)
# Prediction on logit scale
preds <- predict(glm_mpg20, newdata=pred_mtcars, type = "link", se.fit = TRUE)
#Calculate CIs for fitted values
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
# Transform from logit to propability scale
fit2 <- glm_mpg20$family$linkinv(fit)
upr2 <- glm_mpg20$family$linkinv(upr)
lwr2 <- glm_mpg20$family$linkinv(lwr)
# combine fitted values and CIs in one dataframe
fitted_mpg20 <- data.frame(propability=fit2,
lwr=lwr2,
upr=upr2)
# bind CIs and data for prediction
predicted_data <- cbind(pred_mtcars, fitted_mpg20)
# calculate mean propability for vs=0 and vs=1
mean_prop <- predicted %>%
group_by(vs) %>%
summarise(mean_propability = sum(propability*weight)/sum(weight))
Thank you very much for your help and let me know if you need something else from me
You can use the predictions() function from the marginaleffects()
package
(disclaimer: I am the author). Here is a minimal example:
library(marginaleffects)
mod <- glm(am ~ hp + mpg, data = mtcars, family = binomial)
predictions(mod, newdata = datagrid())
## rowid type predicted std.error conf.low conf.high hp mpg
## 1 1 response 0.4441406 0.1413224 0.206472 0.7104511 146.6875 20.09062
By default, the datagrid() function will create a data frame with each
variable set to its mean, but you can pick arbitrary values:
predictions(mod, newdata = datagrid(mpg = 30, hp = c(100, 120)))
## rowid type predicted std.error conf.low conf.high mpg hp
## 1 1 response 0.9999380 0.0002845696 0.6674229 1 30 100
## 2 2 response 0.9999794 0.0001046887 0.6992366 1 30 120
Or feed a data frame:
nd <- data.frame(mpg = 30, hp = c(100, 140))
predictions(mod, newdata = nd)
## rowid type predicted std.error conf.low conf.high mpg hp
## 1 1 response 0.9999380 0.0002845696 0.6674229 1 30 100
## 2 2 response 0.9999931 0.0000382223 0.7255411 1 30 140

Multiple Linear Regression with character as dependent variable

I'm currently trying do perform a multiple linear regression on the voter turnout per state within the 2020 Presidential Election.
To create this regression model I would like to use the following variables: State, Total_Voters and Population.
When I try to run my linear regression I get the following error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
The dataset I've gathered is quite large. I have created a new dataframe with the variables which I need as follows:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
After that I remove all NA values:
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
After that I filter through the dataframe once more and filter out all the states which did not report:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
In the end the dataframe looks like this:
With the following summary:
However when I now try to run my multiple linear regression I get the error I have showcased above. The command looks like this:
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
I'm quite new to linear regressions but I'm eager to learn. I have looked through StackOverflow for quite a bit now, and couldn't figure it out.
It would be greatly appreciated if someone here would be able to assist me.
The full script at once:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
# Change all NA to 0
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
summary(Turnout_Rate_2020)
# Select all again and filter out states which did not report. (values that were NA)
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
# Does not work and if I turn variables around I get NaN values.
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
summary(lmTurnoutRate_2020)
# Does not work
ggplot(lmTurnoutRate_2020, aes(x=State_Full,y=Population)) + geom_point() + geom_smooth(method=lm, level=0.95) + labs(x = "State", y = "Voters")
1) The input is missing from the question so we will use mtcars and make cyl a character column. lm cannot handle that but we could create a 0/1 model matrix from cyl and run that. This performs a separate lm for each level of cyl. This would only be applicable if the dependent variable had a small number of levels as we have here. If your dependent variable is naturally or has been cut into a small number of levels that would be the situation.
(Probably in this case we want to use logistic regression as with glm and family=binomial() or ordinal logistic regression as with polr in MASS or the ordinal package or multinom in nnet package but we will show it with lm just to show it can be done although it probably shouldn't be because with only two values the dependent variable is not sufficiently gaussian.)
mtcars2 <- transform(mtcars, cyl = as.character(cyl))
lm(model.matrix(~ cyl + 0) ~ hp, mtcars2)
giving:
Call:
lm(formula = model.matrix(~cyl + 0) ~ hp, data = mtcars2)
Coefficients:
cyl4 cyl6 cyl8
(Intercept) 1.052957 0.390688 -0.443645
hp -0.004835 -0.001172 0.006007
With polr (which assumes the levels are ordered as they are with cyl):
library(MASS)
polr(cyl ~ hp, transform(mtcars2, cyl = factor(cyl)))
giving:
Call:
polr(formula = cyl ~ hp, data = transform(mtcars2, cyl = factor(cyl)))
Coefficients:
hp
0.1156849
Intercepts:
4|6 6|8
12.32592 17.25331
Residual Deviance: 20.35585
AIC: 26.35585
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The other possibility is that your dependent variable just happens to be represented as character because of how it was created but could be numeric if one used as.numeric(...) on it. We can't tell without the input but using our example we can do this although again it is likely inappropriate because cyl has only 3 values and so does not approximate a gaussian closely enough. Your data may be different though.
lm(cyl ~ hp, transform(mtcars2, cyl = as.numeric(cyl)))
giving:
Call:
lm(formula = cyl ~ hp, data = transform(mtcars2, cyl = as.numeric(cyl)))
Coefficients:
(Intercept) hp
3.00680 0.02168

Getting the same results for two different models in glm() in RStudio

I'm new to the R tool and am having some trouble with the glm() function.
I have some data that I have showed below. When the linear predictor is just x, the glm() function works fine but as soon as I change the linear predictor to x + x^2, it starts giving me the same results that I got for the first model.
The code is as follows:
model1 <- glm(y ~ x, data=data1, family=poisson (link="log"))
coef(model1)
(Intercept) x
0.3396339 0.2565236
model2 <- glm(y ~ x + x^2, data=data1, family=poisson (link="log"))
coef(model2)
(Intercept) x
0.3396339 0.2565236
As you can see there's no coefficient for x^2 as if it's not even in the model.
The lm and glm functions have a special interpretation of the formula (see ?formula) which can be confusing if you are not expecting it. The intended usage of the interface is (w + x)^2 means a*w + b*x + c*w*x + d! If you wish to suppress this you need to use the literal function, I.
model2 <- glm(gear ~ disp + I(disp^2),
data = mtcars, family = poisson (link = "log"))
coef(model2)
# (Intercept) disp I(disp^2)
# 1.542059e+00 -1.248689e-03 6.578518e-07
Put another way, I allows you to perform transformations in the call to glm. The following is equivalent.
mtcars1 <- mtcars
mtcars1$disp_sq <- mtcars1$disp^2
model2a <- glm(gear ~ disp + disp_sq,
data = mtcars1, family = poisson (link = "log"))
coef(model2a)
# (Intercept) disp disp_sq
# 1.542059e+00 -1.248689e-03 6.578518e-07

Resources