How to compare GAMs that include random effects and factors? - r

I am trying to make a model comparison (say, for hypothesis testing) of two GAMs (mgcv package),
where both models include random effects smooth term (s(bs="re")), and the
second model additionally include a factor variable. So:
gm0 <- gam(y ~ s(subject, bs = "re"))
gm1 <- gam(y ~ fac + s(subject, bs = "re"))
For reference, I will use a corresponding pair of LMMs:
lmm0 <- lmer(y ~ (1 | subject))
lmm1 <- lmer(y ~ fac + (1 | subject))
As expected, the fixed and random effects parameters are very close between
GAM and the LMM counterparts. The addition of fac improves the LMM fit,
as it should be, but seemingly not for GAMs. Unlike in LMM, there is barely any
difference in the reported likelihoods (and so AIC) between the two GAM models.
Therefore, my question is: how to compare the fit of such models?
Here's an example:
library(lme4)
library(mgcv)
# Data with one explanatory variable (2-level factor), and 'm' subjects:
n <- 100 # no. of observations
m <- 20 # no. of subjects
set.seed(666)
fac <- gl(2, n / 2, labels = LETTERS[1:2])
subject <- gl(m, n / m, labels = letters[1:m])
beta <- c(-1, 1)
y <- rnorm(n, beta[fac], .25) + rnorm(m, sd = 1.5)[subject]
##
# LMM for comparison:
lmm0 <- lmer(y ~ (1 | subject), REML = FALSE)
lmm1 <- lmer(y ~ fac + (1 | subject), REML = FALSE)
# The two LMM differ by one *df* and the likelihood of
# the second model is higher:
anova(lmm0, lmm1)
## Models:
## lmm0: y ~ (1 | subject)
## lmm1: y ~ fac + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmm0 3 126.49 134.30 -60.243 120.49
## lmm1 4 123.86 134.28 -57.928 115.86 4.6311 1 0.0314 *
# Now equivalent models with GAM and RE-smooth term:
gm0 <- gam(y ~ s(subject, bs = "re"), method = "ML")
gm1 <- update(gm0, . ~ . + fac)
# Fixed effect coefficients, their SE, as well as RE variance components are
# almost identical between GAM and LMM:
# FX:
cbind(coef(gm1)[1:2], fixef(lmm1))
# SE:
cbind(sqrt(diag(vcov(gm1)[1:2, 1:2])),
sqrt(diag(vcov(lmm1))))
# RE variance components:
VarCorr(lmm1)
gam.vcomp(gm1)
# In GAM, unlike in LMM, there is virtually no difference in likelihood of the
# two models in a pair.
AIC(lmm0, lmm1, gm0, gm1)
## df AIC
## lmm0 3.0 126.5
## lmm1 4.0 122.9
## gm0 20.9 30.7 <--
## gm1 20.9 30.7 <--
# Curiously, this happens when 'fac' spans 'subject's, i.e. no subject takes
# two different levels of 'fac'. Reference dfs in 'gm0' (the null model) is 19,
# i.e. number of subjects - 1. But in 'gm1', where 'fac' is included, ref. df.
# is 18.
table(fac, subject)
## subject
## fac a b c d e f g h i j k l m n o p q r s t
## A 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0
## B 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5
summary(gm0)
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(subject) 18.92 19 248.6 <2e-16 ***
summary(gm1)
## edf Ref.df F p-value
## s(subject) 17.91 18 208 <2e-16 ***
# However, when 'fac' intersects 'subject's, i.e. when single subject can encompass
# different levels of 'fac', ref. df is again m - 1 = 19.
gm1b <- gam(y ~ sample(fac) + s(subject, bs = "re"))
summary(gm1b)
## edf Ref.df F p-value
## s(subject) 18.92 19 246.3 <2e-16 ***

Related

Multiplying a categorical variable with a dummy in regression

I am trying to run a regression that has scores regressed with a female dummy ( taking a value of 0 or 1) and I also have country for that female. I am trying to create a fixed effect on the regression where I have female interacted with country, but every method I try does not work since I am multiplying numeric with a factor
I have tried using fastdummies, but that did not work. I also tried using country-1 method, and trying to multiply with female with no success.
#first wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + I(female * factor(country)), data = f1)
# second wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + factor( female * country ), data = f1)
Error messages are that I cannot multiply factor with numeric
The * operator in the formula will give interactions as well as lower order terms. Here is an example:
country <- c("A", "A", "A", "B", "B", "B")
female <- c(1, 1, 0, 1, 0, 1)
y <- 1:6
fm <- lm(y ~ country * female)
fm
giving:
Call:
lm(formula = y ~ country * female)
Coefficients:
(Intercept) countryB female countryB:female
3.0 2.0 -1.5 1.5
Also we can check the model matrix
model.matrix(fm)
giving
(Intercept) countryB female countryB:female
1 1 0 1 0
2 1 0 1 0
3 1 0 0 0
4 1 1 1 1
5 1 1 0 0
6 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
You won't need the I() here. * alone will perform an interaction, whereas I() will execute an arithmetic operation before the regression.
Compare:
lm(pv1math ~ ggi*female, data=dat)$coefficients
# (Intercept) ggi female ggi:female
# ... ... ... ...
lm(pv1math ~ I(ggi*female), data=dat)$coefficients
# (Intercept) I(ggi * female)
# ... ...
I() is useful e.g. for polynomials, where age is a popular candidate: pv1math ~ age + I(age^2) + I(age^3), or to binarize a dependent variable in a GLM: glm(I(pv1math > 0.75) ~ ggi*female, family=binomial).
And - as #G.Grothendieck already wrote - you don't need to repeat the variables that are already present in the interaction term (it's just redundant), so you may want to try:
lm(pv1math ~ ggi*female + factor(year) + female*factor(country), data=f1)

How to write a loop to simulate sampling distribution of t-statistic under null using a true model?

What I currently have a problem with this problem is understanding how to fimulate 10,000 draws and fix the covariates.
Y
<int>
X1
<dbl>
X2
<dbl>
X3
<int>
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
6 rows
That is the head of the grocery code.
What I've done so far for other problems related:
#5.
#using beta_hat
#create a matrix with all the Xs and numbers from 1-52
X <- cbind(rep(1,52), grocery$X1, grocery$X2, grocery$X3)
beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% grocery$Y
round(t(beta_hat), 2)
#using lm formula and residuals
#lm formula
lm0 <- lm(formula = Y ~ X1 + X2 + X3, data = grocery)
#6.
residuals(lm0)[1:5]
Below is what the lm() in the original function:
Call:
lm(formula = Y ~ X1 + X2 + X3, data = grocery)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 ***
X1 0.7871 0.3646 2.159 0.0359 *
X2 -13.1660 23.0917 -0.570 0.5712
X3 623.5545 62.6409 9.954 2.94e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689
F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12
The result should be a loop that can do the sampling distribution in the t test. Right now what I have is for another problem that focuses on fitting the model based on the data.
Here I'm given the true model (for the true hypothesis) but not sure where to begin with the loop.
Okay, have a look at the following:
# get some sample data:
set.seed(42)
df <- data.frame(X1 = rnorm(10), X2 = rnorm(10), X3 = rbinom(10, 1, 0.5))
# note how X1 gets multiplied with 0, to highlight that the null is imposed.
df$y_star <- with(df, 4200 + 0*X1 - 15*X2 + 620 * X3)
head(df)
X1 X2 X3 y_star
1 1.37095845 1.3048697 0 4180.427
2 -0.56469817 2.2866454 0 4165.700
3 0.36312841 -1.3888607 0 4220.833
4 0.63286260 -0.2787888 1 4824.182
5 0.40426832 -0.1333213 0 4202.000
# define function to get the t statistic
get_tstat <- function(){
# declare the outcome, with random noise added:
# The added random noise here will be different in each draw
df$y <- with(df, y_star + rnorm(10, mean = 0, sd = sqrt(20500)))
# run linear model
mod <- lm(y ~ X1 + X2 + X3, data = df)
return(summary(mod)$coefficients["X1", "t value"])
}
# get 10 values from the t-statistic:
replicate(10, get_tstat())
[1] -0.8337737 -1.2567709 -1.2303073 0.3629552 -0.1203216 -0.1150734 0.3533095 1.6261360
[9] 0.8259006 -1.3979176

Is there a way to change the way R labels the interaction parameter in model output?

I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?
Here is a sample of my code:
wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family = binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist), type=factor(1:2)))
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))
This produces a table with predicted values for each of the variables at the specified levels, but not interaction.
For your newdata data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict.
For example:
Create some dummy data:
set.seed(1)
n <- 10000
X <- data.frame(x1=runif(n), x2=runif(n))
X$x1x2 <- X$x1 * X$x2
head(X)
# x1 x2 x1x2
# 1 0.2655087 0.06471249 0.017181728
# 2 0.3721239 0.67661240 0.251783646
# 3 0.5728534 0.73537169 0.421260147
# 4 0.9082078 0.11129967 0.101083225
# 5 0.2016819 0.04665462 0.009409393
# 6 0.8983897 0.13091031 0.117608474
b <- runif(4)
y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
Fit the model and compare the estimated vs. true coefficients:
M <- lm(y ~ x1 * x2, X)
summary(M)
# Call:
# lm(formula = y ~ x1 * x2, data = X)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.43208 -0.06743 -0.00170 0.06601 0.37197
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.202040 0.003906 51.72 <2e-16 ***
# x1 0.128237 0.006809 18.83 <2e-16 ***
# x2 0.156942 0.006763 23.21 <2e-16 ***
# x1:x2 0.292582 0.011773 24.85 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.09906 on 9996 degrees of freedom
# Multiple R-squared: 0.5997, Adjusted R-squared: 0.5996
# F-statistic: 4992 on 3 and 9996 DF, p-value: < 2.2e-16
b
# [1] 0.2106027 0.1147864 0.1453641 0.3099322
Create example data to predict to, and do prediction. Note that we only create x1 and x2, and do not create x1:x2:
X.predict <- data.frame(x1=runif(10), x2=runif(10))
head(X.predict)
# x1 x2
# 1 0.26037592 0.7652155
# 2 0.73988333 0.3352932
# 3 0.02650689 0.9788743
# 4 0.84083874 0.1446228
# 5 0.85052685 0.7674547
# 6 0.13568509 0.9612156
predict(M, newdata=X.predict)
# 1 2 3 4 5 6 7
# 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018
# 8 9 10
# 0.7092744 0.3401867 0.2320834
Or...
An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2.
Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)
And predict to the following data:
X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
If you have categorical variables involved in interactions...
When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1) and one categorical predictor (x2, with levels a, b, and c), then the model y ~ x1 * x2 will estimate six coefficients, describing:
the intercept (i.e. the predicted y when x1 is zero and the observation belongs to the reference level of x2);
the effect of varying x1 when the observation belongs to the reference level of x2 (i.e. the slope, for the reference level of x2);
the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);
the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);
the change in the effect of x1 (i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and
the change in the effect of x1 (i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.
If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1; x2b (binary, indicating whether the observation belongs to level b); x2c (binary, indicating whether the observation belongs to level c); x1x2b (the product of x1 and x2b); and x1x2c (the product of x1 and x2c).
A quick way to do this is with model.matrix:
set.seed(1)
n <- 1000
d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
head(d)
# x1 x2
# 1 0.2655087 b
# 2 0.3721239 c
# 3 0.5728534 b
# 4 0.9082078 c
# 5 0.2016819 a
# 6 0.8983897 a
X <- model.matrix(~x1*x2, d)
head(X)
# (Intercept) x1 x2b x2c x1:x2b x1:x2c
# 1 1 0.2655087 1 0 0.2655087 0.0000000
# 2 1 0.3721239 0 1 0.0000000 0.3721239
# 3 1 0.5728534 1 0 0.5728534 0.0000000
# 4 1 0.9082078 0 1 0.0000000 0.9082078
# 5 1 0.2016819 0 0 0.0000000 0.0000000
# 6 1 0.8983897 0 0 0.0000000 0.0000000
b <- rnorm(6) # coefficients
y <- X %*% b + rnorm(n, sd=0.1)
You can rename the columns of X to whatever you want, as long as you use consistent naming when predicting the model to new data later.
Now fit the model. Here I tell lm not to calculate an intercept (with -1), since the variable (Intercept) already exists in X and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1]):
(M <- lm(y ~ . - 1, as.data.frame(X)))
# Call:
# lm(formula = y ~ . - 1, data = as.data.frame(X))
#
# Coefficients:
# `(Intercept)` x1 x2b x2c `x1:x2b` `x1:x2c`
# 1.14389 1.09168 -0.88879 0.20405 0.09085 -1.63769
Create some new data to predict to, and carry out the prediction:
d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
X.predict <- model.matrix(~x1*x2, d.predict)
y.predict <- predict(M, as.data.frame(X.predict))

Using mice in R changes dummy coding

I'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.
To elaborate, let's say I have a factor, foo, with two levels: 0 and 1. Using a regular lm would typically yield an estimate for foo1. Using mice and the pool function, however, yields an estimate for foo2. I included a reproducible example below using the nhanes dataset from the mice package. Any ideas why the might be occurring?
require(mice)
# Create age as: 0, 1, 2
nhanes$age <- as.factor(nhanes$age - 1)
head(nhanes)
# age bmi hyp chl
# 1 0 NA NA NA
# 2 1 22.7 1 187
# 3 0 NA 1 187
# 4 2 NA NA NA
# 5 0 20.4 1 113
# 6 2 NA NA 184
# Use a regular lm with missing data just to see output
# age1 and age2 come up as expected
lm(chl ~ age + bmi, data = nhanes)
# Call:
# lm(formula = chl ~ age + bmi, data = nhanes)
# Coefficients:
# (Intercept) age1 age2 bmi
# -28.948 55.810 104.724 6.921
imp <- mice(nhanes)
str(complete(imp)) # still the same coding
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
# Now the estimates are for age2 and age3
# Call: pool(object = fit)
# Pooled coefficients:
# (Intercept) age2 age3 bmi
# 29.88431 43.76159 56.57606 5.05537
Apparently the mice function sets contrasts for factors. So you get the following (check out the column names):
contrasts(nhanes$age)
## 1 2
## 0 0 0
## 1 1 0
## 2 0 1
contrasts(imp$data$age)
## 2 3
## 0 0 0
## 1 1 0
## 2 0 1
You can just change the contrasts of the imputed data, then you get the same dummy coding:
imp <- mice(nhanes)
contrasts(imp$data$age) <- contrasts(nhanes$age)
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
## Call: pool(object = fit)
##
## Pooled coefficients:
## (Intercept) age1 age2 bmi
## 0.9771566 47.6351257 63.1332336 6.2589887
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept) age1 age2 bmi
## 0.3210118 0.5554399 0.6421063 0.3036489

lm options, do regression of each category [duplicate]

This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)

Resources