Multinomial regression table using gtsummary; how to get rid of "NA" row? - r

I have experienced the same issue with multinom (nnet) and an extra "N" row above the glance table in tbl_regression (gtsummary) that this user had: Previous post
In the replies to the previous question it was asked to provide reproducable code, so here it is:
library(nnet)
library(gtsummary)
library(tidyverse)
# Create a sample data frame
set.seed(123)
df <- data.frame(
y = sample(c(0, 1, 2), 100, replace = TRUE),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100)
)
# Fit a multinomial logistic regression model with nnet
model <- nnet::multinom(y ~ x1 + x2 + x3, data = df)
# Create a summary table with tbl_regression
model_tab <- tbl_regression(model,
exponentiate = TRUE) %>%
add_glance_table(c(nobs, AIC))
model_tab
I suspect the NA row has to do with tbl_regression producing an "empty model" for the NAs in the dependent variable. When I used Daniel Sjoberg's function to display a multinom model in wide format here, I noted that tbl_regression produced an additional empty model column for the value "NA" of my dependent variable, to the right of my table. I tried to use na.action = na.omit in multinom, to no avail.
So, perhaps tbl_regression is just too buggy for multinom models and I have to shift to another table-producing package. Nonetheless, if anyone has a clue how to avoid the NA issue, I would be happy to continue using the otherwise very useful gtsummary package.

Related

How do I utilize imputed data, with categorical levels, in a prediction in R?

I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.
library(mice)
df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
pooled_model <- pool(model)
This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:
predict(pooled_model, newdata = iris)
I get this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"
Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.
I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.
Any suggestions?
You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.
Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.
Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).
For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.
library(mice)
set.seed(123)
# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)
# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)
# impute
imputed_data <- mice(df,
m = 5,
maxit = 5,
method = meth, # new method
printFlag = FALSE)
# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width + Species))
summary(pool(model))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
head(pred)

sjt.lmer displaying incorrect p-values

I've just noticed that sjt.lmer tables are displaying incorrect p-values, e.g., p-values that do not reflect the model summary. This appears to be a new-ish issue, as this worked fine last month?
Using the provided data and code in the package vignette
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lme4)
library(sjstats)
load sample data
data(efc)
prepare grouping variables
efc$grp = as.factor(efc$e15relat)
levels(x = efc$grp) <- get_labels(efc$e15relat)
efc$care.level <- rec(efc$n4pstu, rec = "0=0;1=1;2=2;3:4=4",
val.labels = c("none", "I", "II", "III"))
data frame for fitted model
mydf <- data.frame(
neg_c_7 = efc$neg_c_7,
sex = to_factor(efc$c161sex),
c12hour = efc$c12hour,
barthel = efc$barthtot,
education = to_factor(efc$c172code),
grp = efc$grp,
carelevel = to_factor(efc$care.level)
)
fit sample models
fit1 <- lmer(neg_c_7 ~ sex + c12hour + barthel + (1 | grp), data = mydf)
summary(fit1)
p_value(fit1, p.kr =TRUE)
model summary
p_value summary
sjt.lmer output does not show these p-values??
Note that the first summary comes from a model fitted with lmerTest, which computes p-values with df based on Satterthwaite approximation (see first line in output).
p_value(), however, with p.kr = TRUE, uses the Kenward-Roger approximation from package pbkrtest, which is a bit more conservative.
Your output from sjt.lmer() seems to be messed up somehow, and I can't reproduce it with your example. My output looks ok:

How to get a model matrix from clmm objects?

I want to estimate a multilevel ordered logistic model and afterwards access the model matrix. When running a simplified example from ?clmm:
library("ordinal")
mod1 <- clmm(SURENESS ~ PROD + (1|RESP), data = soup)
model.matrix(mod1)
I get the error message Error in eval(predvars, data, env) : object 'SURENESS' not found. From other packages I expected that setting parameters like model = TRUE the data going in are also exported to the estimated model, but here all relevant parameters seem to be set accordingly by default. Did I miss some parameters or elements from mod1 (I went through attributes(mod1) but did not find a model matrix.
Strangely if I set a random data.frame, it works:
set.seed(123)
df <- data.frame(y = factor(sample(c("A", "B", "C"), size = 1000, replace = TRUE), ordered = TRUE),
x = rnorm(1000),
id = factor(rep(1:10, each = 100)))
mod2 <- clmm(y ~ 1 + x + (1|id), data = df)
model.matrix(mod2)
So what's the difference between mod1 and mod2 and how do I get a model.matrix from mod1?
I do not think model.matrix(mod2) works for clmm objects. However, you can try to build a parallel model for the fixed effects part using functions like 'polr' and apply model.matrix() to the output object. The random-effects part can be fixed separately by using the clmm output.

R logistic regression model.matrix

I am new to R and I am trying to understand the solution of a logistic regression. All that is done so far is to remove unused variables, split the data into train and test datasets. I am trying t understand part of it where it talks about model.matrix. I am just getting into R and statistics and I am not sure of what is model.matrix and what is contracts. Here is the code:
## create design matrix; indicators for categorical variables (factors)
Xdel <- model.matrix(delay~.,data=DataFD_new)[,-1]
xtrain <- Xdel[train,]
xnew <- Xdel[-train,]
ytrain <- del$delay[train]
ynew <- del$delay[-train]
m1=glm(delay~.,family=binomial,data=data.frame(delay=ytrain,xtrain))
summary(m1)
Can someone please tell me the usage of model.matrix? Why cant we directly create dummy variables of categorical variables and put them in glm? I am confused. What is the usage of model.matrix?
Marius' comment explains how to do this - the below code just gives an example (which I felt was helpful since the poster was still confused).
# Create example dataset. 'catvar' represents a categorical variable despite being coded with numbers.
X = data.frame("catvar" = sample(c(1, 2, 3), 100, replace = T),
"numvar" = rnorm(100),
"y" = sample(c(0, 1), 100, replace = T))
# Check whether you're categorical variables are coded correctly. (They'll say 'factor' if so)
sapply(X, class) #catvar is coded as 'numeric', which is wrong.
# Tell 'R' that catvar is categorical. If your categorical variables are already classed as factors, you can skip this step
X$catvar = factor(X$catvar)
sapply(X, class) # check all variables are coded correctly
# Fit model to dataframe (i.e. without needing to convert X to a model matrix)
fit = glm(y ~ numvar + catvar, data = X, family = "binomial")

how to use loop to do linear regression in R

I wonder if I can use such as for loop or apply function to do the linear regression in R. I have a data frame containing variables such as crim, rm, ad, wd. I want to do simple linear regression of crim on each of other variable.
Thank you!
If you really want to do this, it's pretty trivial with lapply(), where we use it to "loop" over the other columns of df. A custom function takes each variable in turn as x and fits a model for that covariate.
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
mods <- lapply(df[, -1], function(x, dat) lm(crim ~ x, data = dat))
mods is now a list of lm objects. The names of mods contains the names of the covariate used to fit the model. The main negative of this is that all the models are fitted using a variable x. More effort could probably solve this, but I doubt that effort is worth the time.
If you are just selecting models, which may be dubious, there are other ways to achieve this. For example via the leaps package and its regsubsets function:
library("leapls")
a <- regsubsets(crim ~ ., data = df, nvmax = 1, nbest = ncol(df) - 1)
summa <- summary(a)
Then plot(a) will show which of the models is "best", for example.
Original
If I understand what you want (crim is a covariate and the other variables are the responses you want to predict/model using crim), then you don't need a loop. You can do this using a matrix response in a standard lm().
Using some dummy data:
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
we create a matrix or multivariate response via cbind(), passing it the three response variables we're interested in. The remaining parts of the call to lm are entirely the same as for a univariate response:
mods <- lm(cbind(rm, ad, wd) ~ crim, data = df)
mods
> mods
Call:
lm(formula = cbind(rm, ad, wd) ~ crim, data = df)
Coefficients:
rm ad wd
(Intercept) -0.12026 -0.47653 -0.26419
crim -0.26548 0.07145 0.68426
The summary() method produces a standard summary.lm output for each of the responses.
Suppose you want to have response variable fix as first column of your data frame and you want to run simple linear regression multiple times individually with other variable keeping first variable fix as response variable.
h=iris[,-5]
for (j in 2:ncol(h)){
assign(paste("a", j, sep = ""),lm(h[,1]~h[,j]))
}
Above is the code which will create multiple list of regression output and store it in a2,a3,....

Resources