Using mice in R changes dummy coding - r

I'm trying to use the mice package in R for a project and discovered that the pooled results seemed to change the dummy code I had for one of the variables in the output.
To elaborate, let's say I have a factor, foo, with two levels: 0 and 1. Using a regular lm would typically yield an estimate for foo1. Using mice and the pool function, however, yields an estimate for foo2. I included a reproducible example below using the nhanes dataset from the mice package. Any ideas why the might be occurring?
require(mice)
# Create age as: 0, 1, 2
nhanes$age <- as.factor(nhanes$age - 1)
head(nhanes)
# age bmi hyp chl
# 1 0 NA NA NA
# 2 1 22.7 1 187
# 3 0 NA 1 187
# 4 2 NA NA NA
# 5 0 20.4 1 113
# 6 2 NA NA 184
# Use a regular lm with missing data just to see output
# age1 and age2 come up as expected
lm(chl ~ age + bmi, data = nhanes)
# Call:
# lm(formula = chl ~ age + bmi, data = nhanes)
# Coefficients:
# (Intercept) age1 age2 bmi
# -28.948 55.810 104.724 6.921
imp <- mice(nhanes)
str(complete(imp)) # still the same coding
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
# Now the estimates are for age2 and age3
# Call: pool(object = fit)
# Pooled coefficients:
# (Intercept) age2 age3 bmi
# 29.88431 43.76159 56.57606 5.05537

Apparently the mice function sets contrasts for factors. So you get the following (check out the column names):
contrasts(nhanes$age)
## 1 2
## 0 0 0
## 1 1 0
## 2 0 1
contrasts(imp$data$age)
## 2 3
## 0 0 0
## 1 1 0
## 2 0 1
You can just change the contrasts of the imputed data, then you get the same dummy coding:
imp <- mice(nhanes)
contrasts(imp$data$age) <- contrasts(nhanes$age)
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
## Call: pool(object = fit)
##
## Pooled coefficients:
## (Intercept) age1 age2 bmi
## 0.9771566 47.6351257 63.1332336 6.2589887
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept) age1 age2 bmi
## 0.3210118 0.5554399 0.6421063 0.3036489

Related

How to compare GAMs that include random effects and factors?

I am trying to make a model comparison (say, for hypothesis testing) of two GAMs (mgcv package),
where both models include random effects smooth term (s(bs="re")), and the
second model additionally include a factor variable. So:
gm0 <- gam(y ~ s(subject, bs = "re"))
gm1 <- gam(y ~ fac + s(subject, bs = "re"))
For reference, I will use a corresponding pair of LMMs:
lmm0 <- lmer(y ~ (1 | subject))
lmm1 <- lmer(y ~ fac + (1 | subject))
As expected, the fixed and random effects parameters are very close between
GAM and the LMM counterparts. The addition of fac improves the LMM fit,
as it should be, but seemingly not for GAMs. Unlike in LMM, there is barely any
difference in the reported likelihoods (and so AIC) between the two GAM models.
Therefore, my question is: how to compare the fit of such models?
Here's an example:
library(lme4)
library(mgcv)
# Data with one explanatory variable (2-level factor), and 'm' subjects:
n <- 100 # no. of observations
m <- 20 # no. of subjects
set.seed(666)
fac <- gl(2, n / 2, labels = LETTERS[1:2])
subject <- gl(m, n / m, labels = letters[1:m])
beta <- c(-1, 1)
y <- rnorm(n, beta[fac], .25) + rnorm(m, sd = 1.5)[subject]
##
# LMM for comparison:
lmm0 <- lmer(y ~ (1 | subject), REML = FALSE)
lmm1 <- lmer(y ~ fac + (1 | subject), REML = FALSE)
# The two LMM differ by one *df* and the likelihood of
# the second model is higher:
anova(lmm0, lmm1)
## Models:
## lmm0: y ~ (1 | subject)
## lmm1: y ~ fac + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmm0 3 126.49 134.30 -60.243 120.49
## lmm1 4 123.86 134.28 -57.928 115.86 4.6311 1 0.0314 *
# Now equivalent models with GAM and RE-smooth term:
gm0 <- gam(y ~ s(subject, bs = "re"), method = "ML")
gm1 <- update(gm0, . ~ . + fac)
# Fixed effect coefficients, their SE, as well as RE variance components are
# almost identical between GAM and LMM:
# FX:
cbind(coef(gm1)[1:2], fixef(lmm1))
# SE:
cbind(sqrt(diag(vcov(gm1)[1:2, 1:2])),
sqrt(diag(vcov(lmm1))))
# RE variance components:
VarCorr(lmm1)
gam.vcomp(gm1)
# In GAM, unlike in LMM, there is virtually no difference in likelihood of the
# two models in a pair.
AIC(lmm0, lmm1, gm0, gm1)
## df AIC
## lmm0 3.0 126.5
## lmm1 4.0 122.9
## gm0 20.9 30.7 <--
## gm1 20.9 30.7 <--
# Curiously, this happens when 'fac' spans 'subject's, i.e. no subject takes
# two different levels of 'fac'. Reference dfs in 'gm0' (the null model) is 19,
# i.e. number of subjects - 1. But in 'gm1', where 'fac' is included, ref. df.
# is 18.
table(fac, subject)
## subject
## fac a b c d e f g h i j k l m n o p q r s t
## A 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0
## B 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5
summary(gm0)
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(subject) 18.92 19 248.6 <2e-16 ***
summary(gm1)
## edf Ref.df F p-value
## s(subject) 17.91 18 208 <2e-16 ***
# However, when 'fac' intersects 'subject's, i.e. when single subject can encompass
# different levels of 'fac', ref. df is again m - 1 = 19.
gm1b <- gam(y ~ sample(fac) + s(subject, bs = "re"))
summary(gm1b)
## edf Ref.df F p-value
## s(subject) 18.92 19 246.3 <2e-16 ***

Multiplying a categorical variable with a dummy in regression

I am trying to run a regression that has scores regressed with a female dummy ( taking a value of 0 or 1) and I also have country for that female. I am trying to create a fixed effect on the regression where I have female interacted with country, but every method I try does not work since I am multiplying numeric with a factor
I have tried using fastdummies, but that did not work. I also tried using country-1 method, and trying to multiply with female with no success.
#first wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + I(female * factor(country)), data = f1)
# second wrong
olss1= lm(pv1math ~ female + I(ggi*female) + factor(country) + factor(year) + factor( female * country ), data = f1)
Error messages are that I cannot multiply factor with numeric
The * operator in the formula will give interactions as well as lower order terms. Here is an example:
country <- c("A", "A", "A", "B", "B", "B")
female <- c(1, 1, 0, 1, 0, 1)
y <- 1:6
fm <- lm(y ~ country * female)
fm
giving:
Call:
lm(formula = y ~ country * female)
Coefficients:
(Intercept) countryB female countryB:female
3.0 2.0 -1.5 1.5
Also we can check the model matrix
model.matrix(fm)
giving
(Intercept) countryB female countryB:female
1 1 0 1 0
2 1 0 1 0
3 1 0 0 0
4 1 1 1 1
5 1 1 0 0
6 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"
You won't need the I() here. * alone will perform an interaction, whereas I() will execute an arithmetic operation before the regression.
Compare:
lm(pv1math ~ ggi*female, data=dat)$coefficients
# (Intercept) ggi female ggi:female
# ... ... ... ...
lm(pv1math ~ I(ggi*female), data=dat)$coefficients
# (Intercept) I(ggi * female)
# ... ...
I() is useful e.g. for polynomials, where age is a popular candidate: pv1math ~ age + I(age^2) + I(age^3), or to binarize a dependent variable in a GLM: glm(I(pv1math > 0.75) ~ ggi*female, family=binomial).
And - as #G.Grothendieck already wrote - you don't need to repeat the variables that are already present in the interaction term (it's just redundant), so you may want to try:
lm(pv1math ~ ggi*female + factor(year) + female*factor(country), data=f1)

How to present accuracy of different models using caret package in the same list

I'm trying to test models performances using the caret package.I got the results per each of the models but I wouldlike to get a list that will contain the accuracy and ROC of all the models together.How can I do it?
Here is my toy data and two models:
dat <- read.table(text = " target birds wolfs snakes
0 3 9 7
1 3 8 4
1 1 2 8
0 1 2 3
0 1 8 3
1 6 1 2
0 6 7 1
1 6 1 5
0 5 9 7
1 3 8 7
1 4 2 7
0 1 2 3
0 7 6 3
1 6 1 1
0 6 3 9
1 6 1 1 ",header = TRUE)
Here are the two models:
svmRadial <- train(target ~ ., data = dat, method='svmRadial')
glm <- train(target ~ ., data = dat, method='glm')
I would like to get such a table an an output:
ModelName Accuracy ROC
svmRadial 0.95 0.74
glm 0.93 0.7
This is essentially a question on customizing the summaryFunction. You can see a similar question here. Here is a function that is a combination of the defaultSummary and twoClassSummary functions.
mySummary <- function(data, lev = NULL, model = NULL)
{
requireNamespace("pROC")
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
rocObject <- try(pROC::roc.default(data$obs, data[, lev[1]]),
silent = TRUE)
rocAUC <- if (class(rocObject)[1] == "try-error"){
NA
}else{rocObject$auc}
if (!is.factor(data$obs))
data$obs <- factor(data$obs, levels = lev)
Acc <- postResample(data[, "pred"], data[, "obs"])[1]
out <- c(Acc, rocAUC)
names(out) <- c("Accuracy","ROC")
out
}
fitControl <- trainControl(classProbs = TRUE,
summaryFunction = mySummary)
set.seed(123)
svmRadial_acc_roc <- train(as.factor(target) ~ ., data = dat, method='svmRadial', trControl=fitControl)
glm_acc_roc <- train(as.factor(target) ~ ., data = dat, method='glm', trControl=fitControl)
I believe it is considered better practice to look at the distribution of the results. To do so, you would use the resamples function.
results <- resamples(list(svm=svmRadial_acc_roc, glm=glm_acc_roc))
summary(results)
Call:
summary.resamples(object = results)
Models: svm, glm
Number of resamples: 25
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
svm 0.2500 0.5000 0.625 0.6034 0.6667 1.0000 0
glm 0.1667 0.4286 0.500 0.4993 0.6000 0.7143 0
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
svm 0.4444 0.5608 0.6667 0.7422 1.0 1 1
glm 0.4444 0.6250 0.6667 0.7108 0.8 1 0
That said, if you really want that simple table.
# svm had some cross-validation so pull 'best tune'
svm_result <- svmRadial_acc_roc$results[
svmRadial_acc_roc$results$C == svmRadial_acc_roc$bestTune$C,
c("Accuracy", "ROC")]
glm_result <- glm_acc_roc$results[,c("Accuracy", "ROC")]
# make data.frame
data.frame(ModelName = c("svmRadial", "glm"),
Accuracy = c(svm_result$Accuracy, glm_result$Accuracy),
ROC = c(svm_result$ROC, glm_result$ROC)
)
ModelName Accuracy ROC
1 svmRadial 0.6034444 0.7421875
2 glm 0.4993333 0.7107778

Is there a way to change the way R labels the interaction parameter in model output?

I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?
Here is a sample of my code:
wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family = binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist), type=factor(1:2)))
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))
This produces a table with predicted values for each of the variables at the specified levels, but not interaction.
For your newdata data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict.
For example:
Create some dummy data:
set.seed(1)
n <- 10000
X <- data.frame(x1=runif(n), x2=runif(n))
X$x1x2 <- X$x1 * X$x2
head(X)
# x1 x2 x1x2
# 1 0.2655087 0.06471249 0.017181728
# 2 0.3721239 0.67661240 0.251783646
# 3 0.5728534 0.73537169 0.421260147
# 4 0.9082078 0.11129967 0.101083225
# 5 0.2016819 0.04665462 0.009409393
# 6 0.8983897 0.13091031 0.117608474
b <- runif(4)
y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
Fit the model and compare the estimated vs. true coefficients:
M <- lm(y ~ x1 * x2, X)
summary(M)
# Call:
# lm(formula = y ~ x1 * x2, data = X)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.43208 -0.06743 -0.00170 0.06601 0.37197
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.202040 0.003906 51.72 <2e-16 ***
# x1 0.128237 0.006809 18.83 <2e-16 ***
# x2 0.156942 0.006763 23.21 <2e-16 ***
# x1:x2 0.292582 0.011773 24.85 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.09906 on 9996 degrees of freedom
# Multiple R-squared: 0.5997, Adjusted R-squared: 0.5996
# F-statistic: 4992 on 3 and 9996 DF, p-value: < 2.2e-16
b
# [1] 0.2106027 0.1147864 0.1453641 0.3099322
Create example data to predict to, and do prediction. Note that we only create x1 and x2, and do not create x1:x2:
X.predict <- data.frame(x1=runif(10), x2=runif(10))
head(X.predict)
# x1 x2
# 1 0.26037592 0.7652155
# 2 0.73988333 0.3352932
# 3 0.02650689 0.9788743
# 4 0.84083874 0.1446228
# 5 0.85052685 0.7674547
# 6 0.13568509 0.9612156
predict(M, newdata=X.predict)
# 1 2 3 4 5 6 7
# 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018
# 8 9 10
# 0.7092744 0.3401867 0.2320834
Or...
An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2.
Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)
And predict to the following data:
X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
If you have categorical variables involved in interactions...
When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1) and one categorical predictor (x2, with levels a, b, and c), then the model y ~ x1 * x2 will estimate six coefficients, describing:
the intercept (i.e. the predicted y when x1 is zero and the observation belongs to the reference level of x2);
the effect of varying x1 when the observation belongs to the reference level of x2 (i.e. the slope, for the reference level of x2);
the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);
the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);
the change in the effect of x1 (i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and
the change in the effect of x1 (i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.
If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1; x2b (binary, indicating whether the observation belongs to level b); x2c (binary, indicating whether the observation belongs to level c); x1x2b (the product of x1 and x2b); and x1x2c (the product of x1 and x2c).
A quick way to do this is with model.matrix:
set.seed(1)
n <- 1000
d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
head(d)
# x1 x2
# 1 0.2655087 b
# 2 0.3721239 c
# 3 0.5728534 b
# 4 0.9082078 c
# 5 0.2016819 a
# 6 0.8983897 a
X <- model.matrix(~x1*x2, d)
head(X)
# (Intercept) x1 x2b x2c x1:x2b x1:x2c
# 1 1 0.2655087 1 0 0.2655087 0.0000000
# 2 1 0.3721239 0 1 0.0000000 0.3721239
# 3 1 0.5728534 1 0 0.5728534 0.0000000
# 4 1 0.9082078 0 1 0.0000000 0.9082078
# 5 1 0.2016819 0 0 0.0000000 0.0000000
# 6 1 0.8983897 0 0 0.0000000 0.0000000
b <- rnorm(6) # coefficients
y <- X %*% b + rnorm(n, sd=0.1)
You can rename the columns of X to whatever you want, as long as you use consistent naming when predicting the model to new data later.
Now fit the model. Here I tell lm not to calculate an intercept (with -1), since the variable (Intercept) already exists in X and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1]):
(M <- lm(y ~ . - 1, as.data.frame(X)))
# Call:
# lm(formula = y ~ . - 1, data = as.data.frame(X))
#
# Coefficients:
# `(Intercept)` x1 x2b x2c `x1:x2b` `x1:x2c`
# 1.14389 1.09168 -0.88879 0.20405 0.09085 -1.63769
Create some new data to predict to, and carry out the prediction:
d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
X.predict <- model.matrix(~x1*x2, d.predict)
y.predict <- predict(M, as.data.frame(X.predict))

lm options, do regression of each category [duplicate]

This question already has an answer here:
Fitting linear model / ANOVA by group [duplicate]
(1 answer)
Closed 6 years ago.
Data:
Y X levels
y1 x1 2
...
lm(Y~X,I(levels==1))
Does the I(levels==1) mean under levels==1? If not, how can I do regression of Y vs X only when levels equals 1?
Have a look at lmList from the nlme package
set.seed(12345)
dataset <- data.frame(x = rnorm(100), y = rnorm(100), levels = gl(2, 50))
dataset$y <- with(dataset,
y + (0.1 + as.numeric(levels)) * x + 5 * as.numeric(levels)
)
library(nlme)
models <- lmList(y ~ x|levels, data = dataset)
the output is a list of lm models, one per level
models
Call:
Model: y ~ x | levels
Data: dataset
Coefficients:
(Intercept) x
1 4.964104 1.227478
2 10.085231 2.158683
Degrees of freedom: 100 total; 96 residual
Residual standard error: 1.019202
here is the summary of the first model
summary(models[[1]])
Call:
lm(formula = form, data = dat, na.action = na.action)
Residuals:
Min 1Q Median 3Q Max
-2.16569 -1.04457 -0.00318 0.78667 2.65927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9641 0.1617 30.703 < 2e-16 ***
x 1.2275 0.1469 8.354 6.47e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.128 on 48 degrees of freedom
Multiple R-squared: 0.5925, Adjusted R-squared: 0.584
F-statistic: 69.78 on 1 and 48 DF, p-value: 6.469e-11
You have the parameter subset of lm, here is an example.
x <- rnorm(100)
y <- rnorm(100, sd=0.1)
y[1:50] <- y[1:50] + 3*x[1:50] + 10 # line y = 3x+10
y[51:100] <- y[51:100] + 8*x[51:100] - 5 # line y = 8x-5
levels <- rep(1:2, each=50, len=100)
data = data.frame(x=x, y=y, levels=levels)
lm(y ~ x, data=data, subset=levels==1) # regression for the first part
Coefficients: (Intercept) x
10.015 2.996
lm(y ~ x, data=data, subset=levels==2) # second part
Coefficients: (Intercept) x
-4.986 8.000
You are passing I(levels==1) implicitly to subset inside lm.
I was not sure. But this code seems to suggest that you are correct.
my.data <- "x y level
1 2 1
2 4 2
3 4 1
4 3 2
5 5 1
6 5 2
7 7 1
8 6 2
9 10 1
10 5 2"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y,I(level==1), data=my.data2)
my.data <- "x y level
1 2 1
3 4 1
5 5 1
7 7 1
9 10 1"
my.data2 <- read.table(textConnection(my.data), header = T)
my.data2
lm(x ~ y, data=my.data2)

Resources