I am trying to run a regression that has scores regressed with a female dummy ( taking a value of 0 or 1) and I also have country for that female. I am trying to create a fixed effect on the regression where I have female interacted with country, but every method I try does not work since I am multiplying numeric with a factor
I have tried using fastdummies, but that did not work. I also tried using country-1 method, and trying to multiply with female with no success.
#first wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + I(female * factor(country)), data = f1)
# second wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + factor( female * country ), data = f1)
Error messages are that I cannot multiply factor with numeric
The * operator in the formula will give interactions as well as lower order terms. Here is an example:
country <- c("A", "A", "A", "B", "B", "B")
female <- c(1, 1, 0, 1, 0, 1)
y <- 1:6
fm <- lm(y ~ country * female)
fm
giving:
Call:
lm(formula = y ~ country * female)
Coefficients:
(Intercept) countryB female countryB:female
3.0 2.0 -1.5 1.5
Also we can check the model matrix
model.matrix(fm)
giving
(Intercept) countryB female countryB:female
1 1 0 1 0
2 1 0 1 0
3 1 0 0 0
4 1 1 1 1
5 1 1 0 0
6 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
You won't need the I() here. * alone will perform an interaction, whereas I() will execute an arithmetic operation before the regression.
Compare:
lm(pv1math ~ ggi*female, data=dat)$coefficients
# (Intercept) ggi female ggi:female
# ... ... ... ...
lm(pv1math ~ I(ggi*female), data=dat)$coefficients
# (Intercept) I(ggi * female)
# ... ...
I() is useful e.g. for polynomials, where age is a popular candidate: pv1math ~ age + I(age^2) + I(age^3), or to binarize a dependent variable in a GLM: glm(I(pv1math > 0.75) ~ ggi*female, family=binomial).
And - as #G.Grothendieck already wrote - you don't need to repeat the variables that are already present in the interaction term (it's just redundant), so you may want to try:
lm(pv1math ~ ggi*female + factor(year) + female*factor(country), data=f1)
Related
I am trying to make a model comparison (say, for hypothesis testing) of two GAMs (mgcv package),
where both models include random effects smooth term (s(bs="re")), and the
second model additionally include a factor variable. So:
gm0 <- gam(y ~ s(subject, bs = "re"))
gm1 <- gam(y ~ fac + s(subject, bs = "re"))
For reference, I will use a corresponding pair of LMMs:
lmm0 <- lmer(y ~ (1 | subject))
lmm1 <- lmer(y ~ fac + (1 | subject))
As expected, the fixed and random effects parameters are very close between
GAM and the LMM counterparts. The addition of fac improves the LMM fit,
as it should be, but seemingly not for GAMs. Unlike in LMM, there is barely any
difference in the reported likelihoods (and so AIC) between the two GAM models.
Therefore, my question is: how to compare the fit of such models?
Here's an example:
library(lme4)
library(mgcv)
# Data with one explanatory variable (2-level factor), and 'm' subjects:
n <- 100 # no. of observations
m <- 20 # no. of subjects
set.seed(666)
fac <- gl(2, n / 2, labels = LETTERS[1:2])
subject <- gl(m, n / m, labels = letters[1:m])
beta <- c(-1, 1)
y <- rnorm(n, beta[fac], .25) + rnorm(m, sd = 1.5)[subject]
##
# LMM for comparison:
lmm0 <- lmer(y ~ (1 | subject), REML = FALSE)
lmm1 <- lmer(y ~ fac + (1 | subject), REML = FALSE)
# The two LMM differ by one *df* and the likelihood of
# the second model is higher:
anova(lmm0, lmm1)
## Models:
## lmm0: y ~ (1 | subject)
## lmm1: y ~ fac + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmm0 3 126.49 134.30 -60.243 120.49
## lmm1 4 123.86 134.28 -57.928 115.86 4.6311 1 0.0314 *
# Now equivalent models with GAM and RE-smooth term:
gm0 <- gam(y ~ s(subject, bs = "re"), method = "ML")
gm1 <- update(gm0, . ~ . + fac)
# Fixed effect coefficients, their SE, as well as RE variance components are
# almost identical between GAM and LMM:
# FX:
cbind(coef(gm1)[1:2], fixef(lmm1))
# SE:
cbind(sqrt(diag(vcov(gm1)[1:2, 1:2])),
sqrt(diag(vcov(lmm1))))
# RE variance components:
VarCorr(lmm1)
gam.vcomp(gm1)
# In GAM, unlike in LMM, there is virtually no difference in likelihood of the
# two models in a pair.
AIC(lmm0, lmm1, gm0, gm1)
## df AIC
## lmm0 3.0 126.5
## lmm1 4.0 122.9
## gm0 20.9 30.7 <--
## gm1 20.9 30.7 <--
# Curiously, this happens when 'fac' spans 'subject's, i.e. no subject takes
# two different levels of 'fac'. Reference dfs in 'gm0' (the null model) is 19,
# i.e. number of subjects - 1. But in 'gm1', where 'fac' is included, ref. df.
# is 18.
table(fac, subject)
## subject
## fac a b c d e f g h i j k l m n o p q r s t
## A 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0
## B 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5
summary(gm0)
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(subject) 18.92 19 248.6 <2e-16 ***
summary(gm1)
## edf Ref.df F p-value
## s(subject) 17.91 18 208 <2e-16 ***
# However, when 'fac' intersects 'subject's, i.e. when single subject can encompass
# different levels of 'fac', ref. df is again m - 1 = 19.
gm1b <- gam(y ~ sample(fac) + s(subject, bs = "re"))
summary(gm1b)
## edf Ref.df F p-value
## s(subject) 18.92 19 246.3 <2e-16 ***
I want to do a regression in R using glm, but is there a way to do it since I get the contrasts error.
mydf <- data.frame(Group=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12),
WL=rep(c(1,0),12),
New.Runner=c("N","N","N","N","N","N","Y","N","N","N","N","N","N","Y","N","N","N","Y","N","N","N","N","N","Y"),
Last.Run=c(1,5,2,6,5,4,NA,3,7,2,4,9,8,NA,3,5,1,NA,6,10,7,9,2,NA))
mod <- glm(formula = WL~New.Runner+Last.Run, family = binomial, data = mydf)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Using the debug_contr_error and debug_contr_error2 function defined here: How to debug “contrasts can be applied only to factors with 2 or more levels” error? we can easily locate the problem: only a single level is left in variable New.Runner.
info <- debug_contr_error2(WL ~ New.Runner + Last.Run, mydf)
info[c(2, 3)]
#$nlevels
#New.Runner
# 1
#
#$levels
#$levels$New.Runner
#[1] "N"
## the data frame that is actually used by `glm`
dat <- info$mf
A factor of single level can not be applied contrasts to, since any kind of contrasts would reduce the number of levels by 1. By 1 - 1 = 0 this variable would be dropped from the model matrix.
Well then, can we simply require that no contrasts be applied to a single-level factor? No. All contrasts methods forbid this:
contr.helmert(n = 1, contrasts = FALSE)
#Error in contr.helmert(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.poly(n = 1, contrasts = FALSE)
#Error in contr.poly(n = 1, contrasts = FALSE) :
# contrasts not defined for 0 degrees of freedom
contr.sum(n = 1, contrasts = FALSE)
#Error in contr.sum(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.treatment(n = 1, contrasts = FALSE)
#Error in contr.treatment(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.SAS(n = 1, contrasts = FALSE)
#Error in contr.treatment(n, base = if (is.numeric(n) && length(n) == 1L) n else length(n), :
# not enough degrees of freedom to define contrasts
Actually, if you think it carefully, you will conclude that without contrasts, a factor with a single level is just a dummy variable of all 1, i.e., the intercept. So, you can definitely do the following:
dat$New.Runner <- 1 ## set it to 1, as if no contrasts is applied
mod <- glm(formula = WL ~ New.Runner + Last.Run, family = binomial, data = dat)
#(Intercept) New.Runner Last.Run
# 1.4582 NA -0.2507
You get an NA coefficient for New.Runner due to rank-deficiency. In fact, applying contrasts is a fundamental way to avoid rank-deficiency. It is just that when a factor has only one level, application of contrasts becomes a paradox.
Let's also have a look at the model matrix:
model.matrix(mod)
# (Intercept) New.Runner Last.Run
#1 1 1 1
#2 1 1 5
#3 1 1 2
#4 1 1 6
#5 1 1 5
#6 1 1 4
#8 1 1 3
#9 1 1 7
#10 1 1 2
#11 1 1 4
#12 1 1 9
#13 1 1 8
#15 1 1 3
#16 1 1 5
#17 1 1 1
#19 1 1 6
#20 1 1 10
#21 1 1 7
#22 1 1 9
#23 1 1 2
The (intercept) and New.Runner have identical columns and only one of them can be estimated. If you want to estimate New.Runner, drop the intercept:
glm(formula = WL ~ 0 + New.Runner + Last.Run, family = binomial, data = dat)
#New.Runner Last.Run
# 1.4582 -0.2507
Make sure you digest the rank-deficiency issue thoroughly. If you have more than one single-level factors and you replace all of them by 1, dropping a single intercept still results in rank-deficiency.
dat$foo.factor <- 1
glm(formula = WL ~ 0 + New.Runner + foo.factor + Last.Run, family = binomial, data = dat)
#New.Runner foo.factor Last.Run
# 1.4582 NA -0.2507
I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?
Here is a sample of my code:
wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family = binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist), type=factor(1:2)))
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))
This produces a table with predicted values for each of the variables at the specified levels, but not interaction.
For your newdata data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict.
For example:
Create some dummy data:
set.seed(1)
n <- 10000
X <- data.frame(x1=runif(n), x2=runif(n))
X$x1x2 <- X$x1 * X$x2
head(X)
# x1 x2 x1x2
# 1 0.2655087 0.06471249 0.017181728
# 2 0.3721239 0.67661240 0.251783646
# 3 0.5728534 0.73537169 0.421260147
# 4 0.9082078 0.11129967 0.101083225
# 5 0.2016819 0.04665462 0.009409393
# 6 0.8983897 0.13091031 0.117608474
b <- runif(4)
y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
Fit the model and compare the estimated vs. true coefficients:
M <- lm(y ~ x1 * x2, X)
summary(M)
# Call:
# lm(formula = y ~ x1 * x2, data = X)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.43208 -0.06743 -0.00170 0.06601 0.37197
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.202040 0.003906 51.72 <2e-16 ***
# x1 0.128237 0.006809 18.83 <2e-16 ***
# x2 0.156942 0.006763 23.21 <2e-16 ***
# x1:x2 0.292582 0.011773 24.85 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.09906 on 9996 degrees of freedom
# Multiple R-squared: 0.5997, Adjusted R-squared: 0.5996
# F-statistic: 4992 on 3 and 9996 DF, p-value: < 2.2e-16
b
# [1] 0.2106027 0.1147864 0.1453641 0.3099322
Create example data to predict to, and do prediction. Note that we only create x1 and x2, and do not create x1:x2:
X.predict <- data.frame(x1=runif(10), x2=runif(10))
head(X.predict)
# x1 x2
# 1 0.26037592 0.7652155
# 2 0.73988333 0.3352932
# 3 0.02650689 0.9788743
# 4 0.84083874 0.1446228
# 5 0.85052685 0.7674547
# 6 0.13568509 0.9612156
predict(M, newdata=X.predict)
# 1 2 3 4 5 6 7
# 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018
# 8 9 10
# 0.7092744 0.3401867 0.2320834
Or...
An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2.
Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)
And predict to the following data:
X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
If you have categorical variables involved in interactions...
When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1) and one categorical predictor (x2, with levels a, b, and c), then the model y ~ x1 * x2 will estimate six coefficients, describing:
the intercept (i.e. the predicted y when x1 is zero and the observation belongs to the reference level of x2);
the effect of varying x1 when the observation belongs to the reference level of x2 (i.e. the slope, for the reference level of x2);
the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);
the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);
the change in the effect of x1 (i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and
the change in the effect of x1 (i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.
If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1; x2b (binary, indicating whether the observation belongs to level b); x2c (binary, indicating whether the observation belongs to level c); x1x2b (the product of x1 and x2b); and x1x2c (the product of x1 and x2c).
A quick way to do this is with model.matrix:
set.seed(1)
n <- 1000
d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
head(d)
# x1 x2
# 1 0.2655087 b
# 2 0.3721239 c
# 3 0.5728534 b
# 4 0.9082078 c
# 5 0.2016819 a
# 6 0.8983897 a
X <- model.matrix(~x1*x2, d)
head(X)
# (Intercept) x1 x2b x2c x1:x2b x1:x2c
# 1 1 0.2655087 1 0 0.2655087 0.0000000
# 2 1 0.3721239 0 1 0.0000000 0.3721239
# 3 1 0.5728534 1 0 0.5728534 0.0000000
# 4 1 0.9082078 0 1 0.0000000 0.9082078
# 5 1 0.2016819 0 0 0.0000000 0.0000000
# 6 1 0.8983897 0 0 0.0000000 0.0000000
b <- rnorm(6) # coefficients
y <- X %*% b + rnorm(n, sd=0.1)
You can rename the columns of X to whatever you want, as long as you use consistent naming when predicting the model to new data later.
Now fit the model. Here I tell lm not to calculate an intercept (with -1), since the variable (Intercept) already exists in X and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1]):
(M <- lm(y ~ . - 1, as.data.frame(X)))
# Call:
# lm(formula = y ~ . - 1, data = as.data.frame(X))
#
# Coefficients:
# `(Intercept)` x1 x2b x2c `x1:x2b` `x1:x2c`
# 1.14389 1.09168 -0.88879 0.20405 0.09085 -1.63769
Create some new data to predict to, and carry out the prediction:
d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
X.predict <- model.matrix(~x1*x2, d.predict)
y.predict <- predict(M, as.data.frame(X.predict))
Let's say that I have a numeric data matrix with columns w, x, y, z and I also want to add in the columns that are equivalent to w*x, w*y, w*z, x*y, x*z, y*z since I want my covariate matrix to include all pairwise interactions.
Is there a clean and effective way to do this?
If you mean in a model formula, then the ^ operator does this.
## dummy data
set.seed(1)
dat <- data.frame(Y = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10))
The formula is
form <- Y ~ (x + y + z)^2
which gives (using model.matrix() - which is used internally by the standard model fitting functions)
model.matrix(form, data = dat)
R> form <- Y ~ (x + y + z)^2
R> form
Y ~ (x + y + z)^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
If you don't know how many variables you have, or it is tedious to write out all of them, use the . notation too
R> form <- Y ~ .^2
R> model.matrix(form, data = dat)
(Intercept) x y z x:y x:z y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.054026 1.24860
2 1 0.38984 0.78214 -0.10279 0.304911 -0.040071 -0.08039
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.240837 0.02891
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.119162 0.10704
5 1 1.12493 0.61983 -1.37706 0.697261 -1.549097 -0.85354
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.018647 0.02329
7 1 -0.01619 -0.15580 -0.39429 0.002522 0.006384 0.06143
8 1 0.94384 -1.47075 -0.05931 -1.388149 -0.055982 0.08724
9 1 0.82122 -0.47815 1.10003 -0.392667 0.903364 -0.52598
10 1 0.59390 0.41794 0.76318 0.248216 0.453251 0.31896
attr(,"assign")
[1] 0 1 2 3 4 5 6
The "power" in the ^ operator, here 2, controls the order of interactions. With ^2 we get second order interactions of all pairs of variables considered by the ^ operator. If you want up to 3rd-order interactions, then use ^3.
R> form <- Y ~ .^3
R> head(model.matrix(form, data = dat))
(Intercept) x y z x:y x:z y:z x:y:z
1 1 1.51178 0.91898 1.35868 1.389293 2.05403 1.24860 1.887604
2 1 0.38984 0.78214 -0.10279 0.304911 -0.04007 -0.08039 -0.031341
3 1 -0.62124 0.07456 0.38767 -0.046323 -0.24084 0.02891 -0.017958
4 1 -2.21470 -1.98935 -0.05381 4.405817 0.11916 0.10704 -0.237055
5 1 1.12493 0.61983 -1.37706 0.697261 -1.54910 -0.85354 -0.960170
6 1 -0.04493 -0.05613 -0.41499 0.002522 0.01865 0.02329 -0.001047
If you are doing a regression, you can just do something like
reg <- lm(w ~ (x + y + z)^2
and it will figure things out for you. For example,
lm(Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2, iris)
# Call:
# lm(formula = Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length)^2,
# data = iris)
# # Coefficients:
# (Intercept) Sepal.Length Sepal.Width
# -1.05768 0.07628 0.22983
# Petal.Length Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
# 0.47586 -0.03863 -0.03083
# Sepal.Width:Petal.Length
# 0.06493
I'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.
To elaborate, let's say I have a factor, foo, with two levels: 0 and 1. Using a regular lm would typically yield an estimate for foo1. Using mice and the pool function, however, yields an estimate for foo2. I included a reproducible example below using the nhanes dataset from the mice package. Any ideas why the might be occurring?
require(mice)
# Create age as: 0, 1, 2
nhanes$age <- as.factor(nhanes$age - 1)
head(nhanes)
# age bmi hyp chl
# 1 0 NA NA NA
# 2 1 22.7 1 187
# 3 0 NA 1 187
# 4 2 NA NA NA
# 5 0 20.4 1 113
# 6 2 NA NA 184
# Use a regular lm with missing data just to see output
# age1 and age2 come up as expected
lm(chl ~ age + bmi, data = nhanes)
# Call:
# lm(formula = chl ~ age + bmi, data = nhanes)
# Coefficients:
# (Intercept) age1 age2 bmi
# -28.948 55.810 104.724 6.921
imp <- mice(nhanes)
str(complete(imp)) # still the same coding
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
# Now the estimates are for age2 and age3
# Call: pool(object = fit)
# Pooled coefficients:
# (Intercept) age2 age3 bmi
# 29.88431 43.76159 56.57606 5.05537
Apparently the mice function sets contrasts for factors. So you get the following (check out the column names):
contrasts(nhanes$age)
## 1 2
## 0 0 0
## 1 1 0
## 2 0 1
contrasts(imp$data$age)
## 2 3
## 0 0 0
## 1 1 0
## 2 0 1
You can just change the contrasts of the imputed data, then you get the same dummy coding:
imp <- mice(nhanes)
contrasts(imp$data$age) <- contrasts(nhanes$age)
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
## Call: pool(object = fit)
##
## Pooled coefficients:
## (Intercept) age1 age2 bmi
## 0.9771566 47.6351257 63.1332336 6.2589887
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept) age1 age2 bmi
## 0.3210118 0.5554399 0.6421063 0.3036489