test significance between models with emmeans - r

Let's say I have these two models
dat1 <- data.frame(x=factor(c(1,2,1,1,2,2)),y=c(2,5,2,1,7,9))
dat2 <- data.frame(x=factor(c(1,2,1,1,2,2)),y=c(3,3,4,3,4,2))
mod1 <- lm(y~x,data=dat1)
mod2 <- lm(y~x, data=dat2)
and calculate a t test between the levels of x in each model
t1 <- pairs(emmeans(mod1, ~x))
t2 <- pairs(emmeans(mod2, ~x))
How can I assess whether the two models are significantly different for this contrast using emmeans?

dat1$dataset <- "dat1"
dat2$dataset <- "dat2"
alldat <- rbind(dat1, dat2)
modsame <- lm(y ~ x, data = alldat)
moddiff <- lm(y ~ x * dataset, data = alldat)
anova(modsame, moddiff)
Don't try to use emmeans() to do this; that isn't its purpose. The anova() call above compares the two models: modsame presumes that the x effects are the same in each dataset; moddiff adds two terms, dataset which accounts for the change in overall mean, and x:dataset which accounts for the change in x effects.
The comparison between the two models comprises a joint test of both the dataset and the x:dataset effects -- it is an F test with 2 numerator d.f. -- not a t test.

Related

How to loop over columns to evaluate different fixed effects in consecutive lme4 mixed models and extract the coefficients and P values?

I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275

Stargazer one line per data set

I am running regressions using various subsets of a data set and a number of dependent variables.
An example using attitude data:
library(stargazer)
#REGRESSIONS USING DATASET 1
linear1.1 <- lm(rating ~ complaints, data = attitude) #dependent 1
linear1.2 <- lm(privileges ~ complaints, data = attitude) #dependent 2
#REGRESSIONS USING DATASET 2
linear2.1 <- lm(rating ~ complaints, data = attitude[1:15,]) #dependent 1
linear2.2 <- lm(privileges ~ complaints, data = attitude[1:15,]) #dependent 2
As you can see, both depdendent variables rating and privileges are used in regressions for both subsets of the data. Using a standard stargazer approach produces the following table:
stargazer::stargazer(linear1.1,linear1.2,linear2.1,linear2.2,
omit.stat = "all",
keep = "complaints")
Each column represents one of the regression models. However, I'd like to have each column represent one dependent variable. Each subset of the data should represent one row:
I have produced this table by hand. Does anyone know whether it's possible to achieve this using stargazer? I have a lot of regression subsets and dependent variables, so a highly automatic solution is appreciated. Thanks!
I just wonder if this little modification from this (Exporting output of custom multiple regressions from R to Latex) will suit you
library(stargazer)
library(broom)
## generate dummy data
set.seed(123)
x <- runif(1000)
z <- x^0.5
y <- x + z + rnorm(1000, sd=.05)
model1 <- lm(y ~ x)
model2 <- lm(y ~ z)
## transform model summaries into dataframes
tidy(model1) -> model1_tidy
tidy(model2) -> model2_tidy
output <- rbind(model1_tidy,model2_tidy)
stargazer(output, type='text', summary=FALSE)

Population-level prediction from bam {mgcv}

Using bam, I made a logistic mixed model with the following form:
PresAbs ~ s(Var 1) + s(Var 2) + ... + s(Var n) + s(RandomVar, bs = "re")
The RandomVar is a factor and I am not interested in the predictions for each of its level. How can I obtain population-level prediction, comparable to predict.lme?
One way is just exclude the random effect spline from the predictions.
Using the example from ?gam.models
library("mgcv")
dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth
## Now add some random effects to the simulation. Response is
## grouped into one of 20 groups by `fac' and each groups has a
## random effect added....
fac <- as.factor(sample(1:20,400,replace=TRUE))
dat$X <- model.matrix(~fac-1)
b <- rnorm(20)*.5
dat$y <- dat$y + dat$X%*%b
m1 <- gam(y ~ s(fac,bs="re")+s(x0)+s(x1)+s(x2)+s(x3),data=dat,method="ML")
we want to exclude the term s(fac) as it is written in the output from
summary(m1)
For the observed data, population effects are
predict(m1, exclude = 's(fac)')
but you can supply newdata to generate predictions for other combinations of the covariates.

Get Index of variables from stepAIC

I am regressing a gene on another gene subset. Then I use stepAIC to reduce the number of explanatory genes. How do I get the index of the NON-omitted variables, so that I could analyse them?
gene_subset=c(y=genes[,i], genes[,other_genes]);
reduced_model=stepAIC(y~.,data=gene_subset,trace=false);
Here is one solution that I got from r-help mail list, any other more efficient ways would be welcome.
# create example data frame
y <- rnorm(30)
gene_subset <- data.frame(y, x1=rnorm(30), x2=rnorm(30), x3=100*y+rnorm(30))
# fit a full linear model
fit <- lm(y ~ ., df)
# reduce the model
reduced_model <- stepAIC(fit, trace=FALSE)
# NON-omitted variables (excluding the response)
keepx <- names(reduced_model$model)[-1]
index <- match(keepx, names(gene_subset))

Why is leave-one-out cross-validation of GLM model (package=boot) failing when data contains NaN's?

This is a fairly simple procedure - refitting GLM model with subset of data (training set) and calculating the accuracy of the prediction on the remaining data. I am trying to run a "leave-one-out" strategy on a data set (i.e. training subset is length = n-1) using the cv.glm function of the package boot.
Am I doing something wrong, or is this really the case that the function doesn't seem to handle NA's? I'm guessing that this is fairly easy to program on my own, but I would appreciate any advise if there is some other mistake that I am making. Cheers.
Example:
require(boot)
#create data
n <- 100
x <- runif(n)
e <- rnorm(n, sd=100)
a <- 5
b <- 3
y <- exp(a + b*x) + e
plot(y ~ x)
plot(y ~ x, log="y")
#make some y's NaN
set.seed(1)
y[sample(n, 0.1*n)] <- NaN
#fit glm model
df <- data.frame(y=y, x=x)
glm.fit <- glm(y ~ x, data=df, family=gaussian(link="log"))
summary(glm.fit)
#calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(df, glm.fit)
cv.res$delta
[1] NA NA
You're right. The function is not set up to handle NAs. The various options for the na.action argument of the glm() function don't really help, either. The easiest way to deal with it, is to remove the NAs from the data frame at the outset.
sub <- df[!is.na(df$y), ]
glm.fit <- glm(y ~ x, data=sub, family=gaussian(link="log"))
summary(glm.fit)
# calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(sub, glm.fit)
cv.res$delta

Resources