How to use anova() on regression models with missing data? - r

I am trying to run anovas on regression models (LMERs) with 200-400 observations, so I don't want to drop observations based on any missing data.
Here is the problem I am facing, simplified and reproducible:
dats <- data.frame(y = c(5, 3, 7, 4, 8, 4, 7, 3, 6, 3),
x = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1),
z = c(NA, 5, 6, 7, 8, 5, 4, 3, 2, 2))
fit1 <- lm(y ~ x, data = dats, na.action = "na.omit")
fit2 <- lm(y ~ x + z, data = dats, na.action = "na.omit")
anova(fit1, fit2)
And the error I encounter:
Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset
Mainly, I need to run these ANOVAs to know whether the changes in the marginal R^2 in the LMERs are statistically significant. Is there any way of running these regressions and ANOVAs withouth dropping observations with missing data?

Disclaimer: I'm more of a Bayes person and don't really like the NHST world view.
Like #denis, I'm pretty sure the answer is "no". ANOVA is designed to be used on the same data as it's notionally based on comparing sum-of-squares errors.
The obvious options would be to exclude the same rows:
d2 <- na.omit(dats[,c('x','y','z')])
f1 <- lm(y ~ x, data=d2)
f2 <- lm(y ~ x + z, data=d2)
anova(f1, f2)
which obviously throws away data, or your could impute the missing data:
i <- na.action(d2)
d3 <- dats
d3$z[i] <- predict(lm(z ~ x*y, dats), dats[i,])
f1 <- lm(y ~ x, data = d2)
f2 <- lm(y ~ x + z, data = d2)
anova(f1, f2)
but this will underestimate the variance.

Related

How to show the coefficient values and variable importance for logistic regression in R using caret package train() and varImp()

We're performing an exploratory logistic regression and trying to determine the importance of the variables in predicting the outcome. We are using the train() and varImp() functions from the caret package. Ultimately, we would like to create a table/dataframe output that has 3 columns: Variable Name, Importance, and Coefficient. An output like this:
Desired format of output.
Here's some sample code to illustrate:
library(caret)
# Create a sample dataframe
my_DV <- c(0, 1, 0, 1, 1)
IV1 <- c(10, 40, 15, 35, 38)
IV2 <- c(1, 0, 1, 0, 1)
IV3 <- c(5, 4, 3, 2, 1)
IV4 <- c(5, 7, 3, 8, 9)
IV5 <- c(1, 2, 1, 2, 1)
df <- data.frame(my_DV, IV1, IV2, IV3, IV4, IV5)
df$my_DV <- as.factor(df$my_DV)
df$IV1 <- as.numeric(df$IV1)
df$IV2 <- as.factor(df$IV2)
df$IV3 <- as.numeric(df$IV3)
df$IV4 <- as.numeric(df$IV4)
df$IV5 <- as.factor(df$IV5)
# train model/perform logistic regression
model_one <- train(form = my_DV ~ ., data = df, trControl = trainControl(method = "cv", number = 5),
method = "glm", family = "binomial", na.action=na.omit)
summary(model_one)
# get the variable importance
imp <- varImp(model_one)
imp
I would like to take the importance values in imp and merge them with the coefficients from model_one but I'm fairly new to R and I can't figure out how to do it.
Any suggestions are greatly appreciated!
Here is one of many ways to get the desired output:
You assign the summary of the model to an object, and then extract the coefficients using coef() function, and then bind it with the variable names and the corresponding importance into a data frame. You then sort the rows based on the values of importance by using order().
sum_mod <- summary(model_one)
dat <- data.frame(VariableName = rownames(imp$importance),
Importance = unname(imp$importance),
Coefficient = coef(sum_mod)[rownames(imp$importance),][,1],
row.names = NULL)
dat <- dat[order(dat$Importance, decreasing = TRUE),]
The result:
VariableName Importance Coefficient
1 IV1 100.00000 1.0999732
4 IV4 74.48458 3.6665775
2 IV21 34.43803 -7.8831404
3 IV3 0.00000 -0.9166444

How to plot interaction effects with individual-level variable and individual fixed-effects?

I have a dataset with individual id (factor), time t (factor), a dependent variable y (continuous) and an independent variable x (continuous), which can be measured at time t xt or can be set at the individual level xi.
set.seed(100)
df <- data.frame(id=as.factor(rep(1:20, each = 5)),
t=as.factor(rep(1:5, 20)),
y=rnorm(100, 5, 2))
df$xt <- rep(rnorm(100, 0, 1))
df$xi <- rep(rnorm(20, 0, 1), each = 5)
I want to estimate the marginal effects (and plot) of the interaction of time and the individual level IV (t:xi) while controlling for individual fixed effects (id). I know that the FEs of id absorb the effects of xi, but I want to see the effect of the interaction t:xi. Below I show how this works with t:xt but does not work with t:xi.
m1 <- lm(y ~ t + xt + t:xt + id, df)
m2 <- lm(y ~ t + xi + t:xi + id, df)
Effect(focal.predictors = c("t", "xt"), mod = m1)
Effect(focal.predictors = c("t", "xi"), mod = m2)
I have tried different combinations to write the interaction term (t + t:xi, t*xi, etc.), and using different packages (effects, ggeffects, interplot, margins, etc.). Since there are coefficients for t and t:xi, I think there should be a way to estimate and plot these effects (using base-0/change). How could this be done?
A less cumbersome way to estimate the same models that you posted in the answer to your own question would be to use the colon notation:
set.seed(100)
df <- data.frame(id=as.factor(rep(1:20, each = 5)),
t=as.factor(rep(1:5, 20)),
y=rnorm(100, 5, 2))
df$xt <- rep(rnorm(100, 0, 1))
df$xi <- rep(rnorm(20, 0, 1), each = 5)
df$t2_xi <- ifelse(df$t == 2, df$xi, 0)
df$t3_xi <- ifelse(df$t == 3, df$xi, 0)
df$t4_xi <- ifelse(df$t == 4, df$xi, 0)
df$t5_xi <- ifelse(df$t == 5, df$xi, 0)
m1 <- lm(y ~ t + t : xt + id, df)
m2 <- lm(y ~ t + t : xi + id, df)
You may also want to consider the marginaleffects package as an alternative for computing and plotting adjusted predictions and marginal effects. (Disclaimer: I am the author.)
library(marginaleffects)
plot_cap(m1, condition = c("xt", "t"))
plot_cap(m2, condition = c("xi", "t"))
Found a way to do it. It's a little clunky but it works. Basically, you have to create new variable for each interaction except for the first. It's not only that the individual FEs absorb the direct effect of xi, but they also absorb the interaction between xi and time t1. In other words, the baseline effect of xi in time 1 can be estimated based on the individual FEs.
df$t2_xi <- ifelse(df$t == 2, df$xi, 0)
df$t3_xi <- ifelse(df$t == 3, df$xi, 0)
df$t4_xi <- ifelse(df$t == 4, df$xi, 0)
df$t5_xi <- ifelse(df$t == 5, df$xi, 0)
m1 <- lm(y ~ t + t2_xi + t3_xi + t4_xi + t5_xi + id, df)
Effect(focal.predictors = c("t2_xi", "t3_xi", "t4_xi", "t5_xi"), mod = m1)

How to bootstrap the result of arithmetic operation of regression coefficients of two models

I want to do bootstrap of the division of the outputs of two regression models to get confidence intervals of the mean of the result of the operation.
#creating sample data
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
dat<-data.frame(ldose, numdead, sex, SF)
tibble::rowid_to_column(dat, "indices")
#creating the function to be bootstrapped
out<-function(dat) {
d<-data[indices, ] #allows boot to select sample
fit1<- glm(SF ~ sex*ldose, family = binomial (link = log), start=c(-1,0,0,0))
fit2<- glm(SF ~ sex*ldose, family = binomial (link = log), start=c(-1,0,0,0))
coef1<-coef(fit1)
numer<-exp(coef1[2])
coef2<-coef(fit2)
denom<-exp(coef2[2])
resultX<-numer/denom
return(mean(resultX))
}
#doing bootstrap
results <- boot(dat, out, 1000)
#error message
Error in statistic(data, original, ...) : unused argument (original)
Thanks in advance for any help.

Xgboost multiclass predicton with linear booster

Does it make sense to use a linear booster to predict a categorical outcome?
I thought it could work like multinomial logistic regression.
An example in R is as follows,
y <- c(0, 1, 2, 0, 1, 2) # target variable with numeric encoding
x1 <- c(1, 3, 5, 3, 5, 7)
x2 <- rnorm(n = 6, sd = 1) + x1
df <- data.matrix(data.frame(x1, x2, y))
xgb <- xgboost(data = df[, c("x1", "x2")], label = df[, "y"],
params = list(booster = "gblinear", objective = "multi:softmax",
num_class = 3),
save_period = NULL, nrounds = 1)
xgb.importance(model = xgb)
I don't get an error but the importance has 6 features instead of the expected 2. Is there any interpretation of the 6 importances in terms of the 2 input variables? Or does this not make any sense and only gbtree is sensible?
Thanks

Choosing the X variables and plot estimated fitted values [duplicate]

I'm trying the next code to try to see if predict can help me to find the values of the dependent variable for a polynomial of order 2, in this case it is obvious y=x^2:
x <- c(1, 2, 3, 4, 5 , 6)
y <- c(1, 4, 9, 16, 25, 36)
mypol <- lm(y ~ poly(x, 2, raw=TRUE))
> mypol
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))
Coefficients:
(Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
0 0 1
If I try to find the value of x=7, I get this:
> predict(mypol, 7)
Error in eval(predvars, data, env) : not that many frames on the stack
What am I doing wrong?
If you read the help for predict.lm, you will see that it takes a number of arguments including newdata
newdata -- An optional data frame in which to look for variables with
which to predict. If omitted, the fitted values are used.
predict(mypol, newdata = data.frame(x=7))

Resources