I have a multiply-imputed model from R's mice package in which there are lots of factor variables. For example:
library(mice)
library(Hmisc)
# turn all the variables into factors
fake = nhanes
fake$age = as.factor(nhanes$age)
fake$bmi = cut2(nhanes$bmi, g=3)
fake$chl = cut2(nhanes$chl, g=3)
head(fake)
age bmi hyp chl
1 1 <NA> NA <NA>
2 2 [20.4,25.5) 1 [187,206)
3 1 <NA> 1 [187,206)
4 3 <NA> NA <NA>
5 1 [20.4,25.5) 1 [113,187)
6 3 <NA> NA [113,187)
imput = mice(nhanes)
# big model
fit1 = glm.mids((hyp==2) ~ age + bmi + chl, data=imput, family = binomial)
I want to test the significance of each entire factor variable in the model (not the indicator variables for each level) by testing the full model against each possible nested model that drops one variable at a time. Manually, I can do:
# small model (no chl)
fit2 = glm.mids((hyp==2) ~ age + bmi, data=imput, family = binomial)
# extract p-value from pool.compare
pool.compare(fit1, fit2)$pvalue
How can I do this automatically for all the factor variables in my model? The very helpful function drop1 was suggested to me for a previous question -- now I want to do something exactly like that except for the mice case.
Possibly helpful note: An annoying feature of pool.compare is that it appears to want the "extra" variables in the larger model to be placed after the ones that are shared with the smaller model.
You can use a loop to iterate through the different combinations of predictors, after arranging them in the order required for pool.compare.
So using your fake data from above - tweaked the number of categories
library(mice)
library(Hmisc)
# turn all the variables into factors
# turn all the variables into factors
fake <- nhanes
fake$age <- as.factor(nhanes$age)
fake$bmi <- cut2(nhanes$bmi, g=2)
fake$chl <- cut2(nhanes$chl, g=2)
# Impute
imput <- mice(fake, seed=1)
# Create models
# - reduced models with one variable removed
# - full models with extra variables at end of expression
vars <- c("age", "bmi", "chl")
red <- combn(vars, length(vars)-1 , simplify=FALSE)
diffs <- lapply(red, function(i) setdiff(vars, i) )
(full <- lapply(1:length(red), function(i)
paste(c(red[[i]], diffs[[i]]), collapse=" + ")))
#[[1]]
#[1] "age + bmi + chl"
#[[2]]
#[1] "age + chl + bmi"
#[[3]]
#[1] "bmi + chl + age"
(red <- combn(vars, length(vars)-1 , FUN=paste, collapse=" + "))
#[1] "age + bmi" "age + chl" "bmi + chl"
The models are now in the correct order to pass to the glm call. I've also replaced glm.mids method as it has been replaced by with.mids - see ?glm.mids
out <- vector("list", length(red))
for( i in 1:length(red)) {
redMod <- with(imput,
glm(formula(paste("(hyp==2) ~ ", red[[i]])), family = binomial))
fullMod <- with(imput,
glm(formula(paste("(hyp==2) ~ ", full[[i]])), family = binomial))
out[[i]] <- list(predictors = diffs[[i]],
pval = c(pool.compare(fullMod, redMod)$pvalue))
}
do.call(rbind.data.frame, out)
# predictors pval
#2 chl 0.9976629
#21 bmi 0.9985028
#3 age 0.9815831
# Check manually by leaving out chl
mod1 <- with(imput, glm((hyp==2) ~ age + bmi + chl , family = binomial))
mod2 <- with(imput, glm((hyp==2) ~ age + bmi , family = binomial))
pool.compare(mod1, mod2)$pvalue
# [,1]
#[1,] 0.9976629
You will get a lot of warnings using this dataset
EDIT
You could wrap this in a function
impGlmDrop1 <- function(vars, outcome, Data=imput, Family="binomial")
{
red <- combn(vars, length(vars)-1 , simplify=FALSE)
diffs <- lapply(red, function(i) setdiff(vars, i))
full <- lapply(1:length(red), function(i)
paste(c(red[[i]], diffs[[i]]), collapse=" + "))
red <- combn(vars, length(vars)-1 , FUN=paste, collapse=" + ")
out <- vector("list", length(red))
for( i in 1:length(red)) {
redMod <- with(Data,
glm(formula(paste(outcome, red[[i]], sep="~")), family = Family))
fullMod <- with(Data,
glm(formula(paste(outcome, full[[i]], sep="~")), family = Family))
out[[i]] <- list(predictors = diffs[[i]],
pval = c(pool.compare(fullMod, redMod)$pvalue) )
}
do.call(rbind.data.frame, out)
}
# Run
impGlmDrop1(c("age", "bmi", "chl"), "(hyp==2)")
Related
I have trouble looping through a regression model dropping one observation each time to estimate the effect of influential observations.
I would like to run the model several times, each time dropping the ith observation and extracting the relevant coefficient estimate and store it in a vector. I think this could quite easily be done with a fairly straight forward loop, however, I'm stuck at the specifics.
I want to be left with a vector containing n coefficient estimates from n iterations of the same model. Any help would be beneficial!
Below I provide some dummy data and example code.
#Dummy data:
set.seed(489)
patientn <- rep(1:400)
gender <- rbinom(400, 1, 0.5)
productid <- rep(c("Product A","Product B"), times=200)
country <- rep(c("USA","UK","Canada","Mexico"), each=50)
baselarea <- rnorm(400,400,60) #baseline area
baselarea2 <- rnorm(400,400,65) #baseline area2
sfactor <- c(
rep(c(0.3,0.9), times = 25),
rep(c(0.4,0.5), times = 25),
rep(c(0.2,0.4), times = 25),
rep(c(0.3,0.7), times = 25)
)
rashdummy2a <- data.frame(patientn,gender,productid,country,baselarea,baselarea2,sfactor)
Data <- rashdummy2a %>% mutate(rashleft = baselarea2*sfactor/baselarea*100) ```
## Example of how this can be done manually:
# model
m1<-lm(rashleft ~ gender + baselarea + sfactor, data = data)
# extracting relevant coefficient estimates, each time dropping a different "patient" ("patientn")
betas <- c(lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=1)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=2)$coefficients[2],
lm(rashleft ~ gender + baselarea + sfactor, data = rashdummy2b, patientn !=3)$coefficients[2])
# the betas vector now stores the relevant coefficient estimates (coefficient nr 2, for gender) for three different variations of the model.
We can use a for loop. In your question you use an object rashdummy2b which is not defined. Now I used data but you can replace that by an object of choice.
#create list to bind results to
result <- list()
#loop through patients and extract betas
for(i in unique(data$patientn)){
#construct linear model
lm.model <- lm(rashleft ~ gender + baselarea + sfactor, data = subset(data, data$patientn != i))
#create data.frame containing patient left out and coefficient
result.dt <- data.frame(beta = lm.model$coefficients[[2]],
patient_left_out = i)
#bind to list
result[[i]] <- result.dt
}
#bind to data.frame
result <- do.call(rbind, result)
Result
head(result)
beta patient_left_out
1 1.381248 1
2 1.345188 2
3 1.427784 3
4 1.361674 4
5 1.420417 5
6 1.454196 6
You can drop a particular row (or column) by using a negative index. In your case, you proceed as follows:
betas <- numeric(nrow(rashdummy2b)) # memory preallocation
for (i in 1:nrow(rashdummy2b)) {
betas[i] <- lm(rashleft ~ gender + baselarea + sfactor, data=rashdummy2b[-i,])$coefficients[2]
}
I'm fitting a y ~ v + m + s + m:s + (1|subunit) model with lmer(). s is a continuous variable interacting with m, a categorical factor with 3 levels: A, B, and C.
Fitting the model uses A as the reference level for factor m:
fit_ref_A <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
The parameter estimates for B and C can just be calculated using the estimate for A and the differences for B and C. I'm interested in extracting the confidence intervals.
Running confint() gives the confidence interval for the slope of variable s at A. I'm also interested in the confidence intervals of the slopes of s at B and C, not just the confidence intervals for the differences from the slope at A. Is there a way to extract this from fit_ref_A? So far the only thing I've been able to figure out is to relevel with B as the reference, fit a new fit, then relevel with C as the reference, and fit the third fit.
Question: Is there a way to extract everything (especially the confidence intervals) from fit_ref_A?
Code:
library(lme4)
# create the dataset, unbalanced at the lowest stratum ( 2 repeats for m==A instead of 3)
set.seed(2)
s_levels <- 1:5
m_levels <- c("A", "B", "C")
v_levels <- c("L2", "L3", "L4")
reps <- 1:3
df <- expand.grid(rep=reps, s=s_levels, m=m_levels, v=v_levels)
df$subunit <- as.factor(paste(df$v,"-",df$m,"-",df$s, sep=""))
df$y <- rnorm(nrow(df), 0, 1)
df <- subset(df, !(rep==3 & m=="A")) # drop the 3rd repeat for m=="A"
table(df$m) # shows 30 for A, 45 for B, 45 for C as expected
# fit 3 different models, with three different reference levels for 'm'
fit_ref_A <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
df$m <- relevel(df$m, ref = "B")
fit_ref_B <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
df$m <- relevel(df$m, ref = "C")
fit_ref_C <- lmer(y ~ 1 + v + m*s + (1|subunit), data=df)
# Calculate the confidence intervals for the continuous variable s at the three
# different levels for categorical factor m. Must use 3 separate fits.
cis_at_m_reference_A <- confint(fit_ref_A)
cis_at_m_reference_B <- confint(fit_ref_B)
cis_at_m_reference_C <- confint(fit_ref_C)
cis_at_m_reference_A["s",]
cis_at_m_reference_B["s",]
cis_at_m_reference_C["s",]
# Any way to just extract all three from fit_ref_A?
You can get approximate CIs with Gaussian error propagation:
sum(fixef(fit_ref_A)[c("s", "mB:s")]) +
c(-1.96, 1.96) * sqrt(sum(vcov(fit_ref_A)[c("s", "mB:s"), c("s", "mB:s")]))
#[1] -0.3346310 0.1863014
Or you could bootstrap:
myboot <- bootMer(fit_ref_A, function(x) {
cf <- fixef(x)
c(sA = cf[["s"]], sB = cf[["s"]] + cf[["mB:s"]], sC = cf[["s"]] + cf[["mC:s"]])
}, nsim = 1e4, seed = 42)
apply(myboot$t, 2, quantile, probs = c(0.025, 0.975))
# sA sB sC
#2.5% -0.4022927 -0.3415690 -0.3969831
#97.5% 0.2041610 0.1858731 0.1266355
I currently have following code with two functions that calculate the model fit for two distinct models. The difference is in the lm function, where + log(v2) has been added in model 2.
R code
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100), v2 = runif(30, 1, 100))
p0 <- 1 # number of parameters in lm()
p1 <- 2 # number of parameters in lm()
n <- nrow(dat) - 1
## Model 1 Loop
model1 <- function(x) {
fit <- lm(log(clicks) ~ log(v1), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 1 Regression
result_m1 <- t(sapply(p0:n, model1))
data.frame(result_m1)
## Model 2 Loop
model2 <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v2), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 2 Regression
result_m2 <- t(sapply(p1:n, model2))
data.frame(result_m2)
Question: Can I somehow create a function that implements a loop for the different models only, instead of repeating the calculation for every model?
I have something like this in mind but weren't able to implement it .http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm
I don't see a point in recreating a function that can be easily done with model-selection functions in available packages.
library(leaps)
library(dplyr)
b <- regsubsets(clicks ~ ., data=dat, nbest=10, nvmax=2) # carries out exhaustive model selection (10 best models; 2 maximum predictors)
coef(b, 1:3) # returns coefficient for the 3 models in this case
[[1]]
(Intercept) v1
60.8067570 -0.2665699
[[2]]
(Intercept) v2
49.96974177 -0.05227489
[[3]]
(Intercept) v1 v2
62.02323816 -0.26422966 -0.02676747
summary(b)$rsq #provide r.squared value for 3 models
[1] 0.067952759 0.002366681 0.068568059
To run prediction is a tad more complicated.
all.mods <- summary(b)$which[,-1] # gives logic output of predictors combination
all.mods
v1 v2
1 TRUE FALSE
1 FALSE TRUE
2 TRUE TRUE
RHS <- lapply(seq(nrow(all.mods)), function(m) summary(b)$which[m,-1] %>% which %>% names %>% paste(., collapse="+"))
RHS
[[1]]
[1] "v1"
[[2]]
[1] "v2"
[[3]]
[1] "v1+v2"
lm.form <- lapply(RHS, function(m)parse(text=paste("lm(clicks ~", m, ", data=dat)")))
lm.mods <- lapply(lm.form, eval) # return list of all lm.mods generated
The list of lm.mods can subsequently be used for predict with new.data.
Using R I would like to compare the RMSE (root mean square error) from two prediction models. The first model uses estimates from 1966 to 2000 to predict 2001 and then uses estimates from 1966 to 2001 to predict 2002 and so on up to 2015. The second model uses estimates from 1991 to 2000 to predict 2001 and then uses estimates from 1992 to 2001 to predict 2002 and so on up to 2015. This problem has me really stumped and I truly appreciate any help.
DF <- data.frame(YEAR=1966:2015, TEMP=rnorm(50), PRESSURE=rnorm(50), RAINFALL=rnorm(50))
lmod <- lm(TEMP ~ PRESSURE + RAINFALL, data = DF)
rmse <- function(error) sqrt(mean(error^2))
rmse(lmod$residuals)
You can loop it:
Method 1:
pred1<-numeric(0)
rmse1<-numeric(0)
for(i in 1:15){
DF.train1<-DF[DF$YEAR < 2000+i,]
DF.test1<-DF[DF$YEAR == 2000+i,]
lmod1 <- lm(TEMP ~ PRESSURE + RAINFALL, data = DF.train1)
pred1[i]<- predict(lmod1, newdata = DF.test1)
rmse1[i]<-sqrt(mean((DF.test1$TEMP-pred1[i])^2))
}
pred1
rmse1
mean(rmse1)
Method 2:
pred2<-numeric(0)
rmse2<-numeric(0)
for(i in 1:15){
DF.train2<-DF[DF$YEAR < 2000+i & DF$YEAR > 1989+i,]
DF.test2<-DF[DF$YEAR == 2000+i,]
lmod2 <- lm(TEMP ~ PRESSURE + RAINFALL, data = DF.train2)
pred2[i]<- predict(lmod2, newdata = DF.test2)
rmse2[i]<-sqrt(mean((DF.test2$TEMP-pred2[i])^2))
}
pred2
rmse2
mean(rmse2)
Comparing the individual components of rmse1 and rmse2, as well as their respective means should be useful. The vectors pred1 and pred2 contain the individual TEMP predictions for each year (2001-2015) for their respective methods.
Edit: should be working now, and Method 2 trains on a rolling 10 year gap. Also, I take RMSE to be the square root of the MSE as defined for predictors in this article.
Here is another solution, where simulations are in a function.
The interest of this solution is to easily modify model specifications.
For example, if you want to try the model2 with a range of 15 years instead of 10, just modify the input in the function (range = 15). This also gives you the possibility to do a light sensibility analysis.
compare_models <- function(DF, start = 1966, end = 2000, range = 10)
{
require(hydroGOF)
for (i in (end+1):tail(DF$YEAR)[6])
{
# model1
lmod_1 = lm(TEMP ~ PRESSURE + RAINFALL, data = DF[DF$YEAR >= start & DF$YEAR < i,])
DF$model1_sim[DF$YEAR == i] <- predict(lmod_1, newdata = DF[DF$YEAR == i,])
# model2
lmod_2 = lm(TEMP ~ PRESSURE + RAINFALL, data = DF[DF$YEAR >= i-range & DF$YEAR < i,])
DF$model2_sim[DF$YEAR == i] <- predict(lmod_2, newdata = DF[DF$YEAR == i,])
}
return(DF)
}
I used hydroGOF package to compute rmse and NSE, which is a common indicator of model efficiency (see Nash and Sutcliffe, 1970, 11528 citations at the moment).
output = compare_models(DF)
require(hydroGOF) # compute RMSE and NSE
# RMSE
rmse(output$model1_sim,output$TEMP)
rmse(output$model2_sim,output$TEMP)
# Nash-Sutcliffe efficiency
NSE(output$model1_sim,output$TEMP, na.rm = T)
NSE(output$model2_sim,output$TEMP, na.rm = T)
And a simple simulated/observed plot to look for model predictions:
# melting data for plot
output_melt = melt(output[,c("TEMP", "model1_sim", "model2_sim")], id = "TEMP")
# Plot
ggplot(output_melt, aes(x = TEMP, y = value, color = variable)) +
theme_bw() + geom_point() + geom_abline(slope = 1, intercept = 0) +
xlim(-2,2) + ylim(-2,2) + xlab("Measured") + ylab("Simulated")
Here's yet another solution:
year <- 2000
time.frame <- 35
train.models <- function(year, time.frame) {
predictions <- sapply(year:(max(df$YEAR)-1),
function(year) {
lmod <- lm(TEMP ~ PRESSURE + RAINFALL, DF,
subset = with(DF, YEAR %in% (year - time.frame + 1):year))
pred <- predict(lmod, newdata = DF[DF$YEAR == (year + 1),])
names(pred) <- year + 1
return (pred)
})
return (predictions)
}
models1 <- train.models(2000, 35)
models2 <- train.models(2001, 10)
rmse(models1 - DF$TEMP[DF$YEAR %in% names(models1)])
rmse(models2 - DF$TEMP[DF$YEAR %in% names(models2)])
Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>