Description of columns in `augment.merMod`() - r

I appreciate broom.mixed ability to capture mixed-effects modeling in nice tidy formats. In assessing assumptions for the linear mixed effects model, I am finding that the augment function is particularly useful. However, the documentation fails to state what all the columns are for augment.merMod().
library(lme4)
library(broom.mixed)
set.seed(101)
dd <- expand.grid(f1 = factor(1:3),
f2 = LETTERS[1:2], g=1:9, rep=1:15,
KEEP.OUT.ATTRS=FALSE)
summary(mu <- 5*(-4 + with(dd, as.integer(f1) + 4*as.numeric(f2))))
dd$y <- rnbinom(nrow(dd), mu = mu, size = 0.5)
m.nb <- glmer.nb(y ~ f1*f2 + (1|g), data=dd, verbose=FALSE)
head(augment(m.nb))
Here is what the documentation says:
augment returns one row for each original observation, with columns (each prepended by a .) added. Included are the columns
.fitted predicted values
.resid residuals
.fixed predicted values with no random effects
Also added for "merMod" objects, but not for "mer" objects, are values from the response object within the model (of type lmResp, glmResp, nlsResp, etc). These include ".mu", ".offset", ".sqrtXwt", ".sqrtrwt", ".eta".
What are these columns: ".mu", ".sqrtXwt", ".sqrtrwt", ".eta" ? Is .fitted the predicted values on the model scale? And .mu on the response scale (in other words, the inverse link function is applied to predicted values)?

Related

How to handle zero-inflated (semi-)continuous data in R?

I would like to model / fit Value on explanatory variables Type and Material (Value ~ Material + Type). Having a look at the sample test data provided here, one could see that Material X has all zero Values except for one, which makes the distribution of Value zero-inflated, across all observations. Given the model diagnostics, linear assumptions do not hold here.
Value is a numeric variable, and all observations are independent from each other.
I would like to know how can I find a proper distribution for this data, or transform it in a way that I could handle these zeros.
I read about gamlss and pscl packages, but I struggled applying them to my data.
ID <- seq(from = 1, to = 36)
Type <- rep(c("A", "B"),each=18)
Material <- rep (c("X","Y","Z","X","Y","Z"), each = 6)
Value <- c(0,0,0,2,0,0,27,50,30,103,104,223,147,
127,115,78,148,297,0,0,0,0,0,0,84,
59,56,53,64,86,90,75,95,111,215,191)
test.data <- data.frame(ID,Type,Material,Value)
test.data$ID <- factor(test.data$ID)
test.data$Type <- factor(test.data$Type)
test.data$Material <- factor(test.data$Material)
You could try:
m1 <- gamlss(Value ~ Material + Type, sigma.fo =~ Material + Type,
family=ZIP)
ZIP(mu, sigma) is a zero inflated Poisson distribution,
which is a mixture of zero with probability sigma,
and a Poisson distribution PO(mu) with probability (1-sigma).
You could then look at the residuals using
plot(m1) or
wp(m1)
The model may not be adequate and may need a
zero inflated negative binomial distribution, ZINBI(mu,sigma,nu)
which is a mixture of zero with probability nu,
and a negative binomial distribution NBI(mu,sigma) with probability (1-nu):
m2 <- gamlss(Value ~ Material + Type, sigma.fo =~ Material + Type,
nu.fo =~ Material + Type,family=ZIPBNI)
Alternatively an interaction term may be needed for mu, (and/or sigma or nu), e.g.
m3 <- gamlss(Value ~ Material*Type, sigma.fo =~ Material + Type,
family=ZIP)

How to repeat univariate regression and extract P values?

I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. but I'm not sure how to extract the P values at a time.
There are 200 features in my dataset, but the code below only gave me the P value of feature#1. How can I get a matrix of all P values of the 200 features?
valName<- as.data.frame(colnames(repeatData))
featureName<-valName[3,]
lapply(featureName,
function(var) {
formula <- as.formula(paste("outcome ~", var))
fit.logist <- glm(formula, data = repeatData, family = binomial)
summary(fit.logist)
Pvalue<-coef(summary(fit.logist))[,'Pr(>|z|)']
})
I
I simplified your code a little bit; (1) used reformulate() (not really different, just prettier) (2) returned only the p-value for the focal variable (not the intercept p-value). (If you leave out the 2, you'll get a 2-row matrix with intercept and focal-variable p-values.)
My example uses the built-in mtcars data set, with an added (fake) binomial response.
repeatData <- data.frame(outcome=rbinom(nrow(mtcars), size=1, prob=0.5), mtcars)
ff <- function(var) {
formula <- reformulate(var, response="outcome")
fit.logist <- glm(formula, data = repeatData, family = binomial)
coef(summary(fit.logist))[2, 'Pr(>|z|)']
}
## skip first column (response variable).
sapply(names(repeatData)[-1], ff)

Getting standardized coefficients for a glmer model?

I've been asked to provide standardized coefficients for a glmer model, but am not sure how to obtain them. Unfortunately, the beta function does not work on glmer models:
Error in UseMethod("beta") :
no applicable method for 'beta' applied to an object of class "c('glmerMod', 'merMod')"
Are there other functions I could use, or would I have to write one myself?
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
The model is as follows:
model=glmer(cbind(nr_corr,maximum-nr_corr) ~ (condition|SUBJECT) + categorical_1 + categorical_2 + continuous_1 + continuous_2 + continuous_3 + continuous_4 + categorical_1:categorical_2 + categorical_1:continuous_3, data, control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=100000)), family = binomial)
reghelper::beta simply standardizes the numeric variables in our dataset. So assuming your catagorical variables are factors rather than numeric dummy variables or other contrast encodings we can fairly simply standardize the numeric variables in our dataset
vars <- grep('^continuous(.*)?', all.vars(formula(model)))
f <- function(var, data)
scale(data[[var]])
data[, vars] <- lapply(vars, f, data = data)
update(model, data = data)
Now for the more general case we can more or less just as easily create our own beta.merMod function. However we will need to take into account whether or not it makes sense to standardize y. For example if we have a poisson model only positive integer values makes sense. In addition a question becomes whether or not to scale the random slope effects or not, and whether it makes sense to ask this question in the first place. In it I assume that categorical variables are encoded as character or factor and not numeric or integer.
beta.merMod <- function(model,
x = TRUE,
y = !family(model) %in% c('binomial', 'poisson'),
ran_eff = FALSE,
skip = NULL,
...){
# Extract all names from the model formula
vars <- all.vars(form <- formula(model))
lhs <- all.vars(form[[2]])
# Get random effects from the
ranef <- names(ranef(model))
# Remove ranef and lhs from vars
rhs <- vars[!vars %in% c(lhs, ranef)]
# extract the data used for the model
env <- environment(form)
call <- getCall(model)
data <- get(dname <- as.character(call$data), envir = env)
# standardize the dataset
vars <- character()
if(isTRUE(x))
vars <- c(vars, rhs)
if(isTRUE(y))
vars <- c(vars, lhs)
if(isTRUE(ran_eff))
vars <- c(vars, ranef)
data[, vars] <- lapply(vars, function(var){
if(is.numeric(data[[var]]))
data[[var]] <- scale(data[[var]])
data[[var]]
})
# Update the model and change the data into the new data.
update(model, data = data)
}
The function works for both linear and generalized linear mixed effect models (not tested for nonlinear models), and is used just like other beta functions from reghelper
library(reghelper)
library(lme4)
# Linear mixed effect model
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm2 <- beta(fm1)
fixef(fm1) - fixef(fm2)
(Intercept) Days
-47.10279 -19.68157
# Generalized mixed effect model
data(cbpp)
# create numeric variable correlated with period
cbpp$nv <-
rnorm(nrow(cbpp), mean = as.numeric(levels(cbpp$period))[as.numeric(cbpp$period)])
gm1 <- glmer(cbind(incidence, size - incidence) ~ nv + (1 | herd),
family = binomial, data = cbpp)
gm2 <- beta(gm1)
fixef(gm1) - fixef(gm2)
(Intercept) nv
0.5946322 0.1401114
Note however that unlike beta the function returns the updated model not a summary of the model.
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
Now that is a great question and one better suited for stats.stackexchange, and not one I'm certain of the answer to.
Again, thank you so much, Oliver! For anybody who is interested in the answer regarding the last part of my question,
Another problem is that the model contains several continuous
predictors (which operate on similar scales) and 2 categorical
predictors (one with 4 levels, one with six levels). The purpose of
using the standardized coefficients would be to compare the impact of
the categorical predictors to those of the continuous ones, and I'm
not sure that standardized coefficients are the appropriate way to do
so. Are standardized coefficients an acceptable approach?
you can find the answer here. The tl;dr is that using standardized regression coefficients is not the best approach for mixed models anyways, let alone one such as mine...

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.
I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:
mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:
set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
stringsAsFactors=FALSE)
Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:
library(arm)
n.sims = 1000 # set number of simulations
n.tilde = nrow(fakeiris) # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"]) # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims) # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde)) # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris#coef[s,], sim.fakeiris#sigma[s]) } # use matrix multiplication to fill that array
That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.
Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
Okay, that works, but now what? I run:
sim.fakeiris.lmer <- sim(mod1, n.sims)
When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:
#fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)
#ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)
#sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations
I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.
Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?
One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,
install.packages("devtools")
devtools::install_github("jknowles/merTools")
To get the median and a 95% credible interval of 100 simulations:
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median")
By default, predictInterval includes the residual variation, but you can
turn that feature off with:
out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median",
include.resid.var=FALSE)
Hope this helps!
This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.
In your use case, you would specify newdata=fakeiris.
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
v <- vcov(object)
b <- fixef(object)
betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
npred <- if (is.null(newdata)) {
length(predict(object))
} else nrow(newdata)
res <- matrix(NA,npred,n)
for (i in 1:n) {
mod1#pp$setBeta0(betapars[i,])
res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
}
return(res)
}
ss <- simfun(mod1,100)

Is there a predict function for plm in R?

I have a small N large T panel which I am estimating via plm::plm (panel linear regression model), with fixed effects.
Is there any way to get predicted values for a new dataset? (I want to
estimate parameters on a subset of my sample, and then use these to
calculate model-implied values for the whole sample).
There are (at least) two methods in the package to produce estimates from plm objects:
-- fixef.plm: Extract the Fixed Effects
-- pmodel.response: A function to extract the model.response
It appears to me that the author(s) are not interested in providing estimates for the "random effects". It may be a matter of "if you don't know how to do it on your own, then we don't want to give you a sharp knife to cut yourself too deeply."
I wrote a function called predict.out.plm that can create predictions for the original data and for a manipulated data set (with equal column names).
The predict.out.plm calculates a) the predicted (fitted) outcome of the transformed data and b) constructs the according to level outcome. The function works for First Difference (FD) estimations and Fixed Effects (FE) estimations using plm. For FD it creates the differenced outcome over time and for FE it creates the time-demeaned outcome.
The function is largely untested, and probably only works with strongly balanced data frames.
Any suggestions and corrections are very welcome. Help to develop a small R package would be very appreciated.
The function predict.out.plm
predict.out.plm<-function(
estimate,
formula,
data,
model="fd",
pname="y",
pindex=NULL,
levelconstr=T
){
# estimate=e.fe
# formula=f
# data=d
# model="within"
# pname="y"
# pindex=NULL
# levelconstr=T
#get index of panel data
if (is.null(pindex) && class(data)[1]=="pdata.frame") {
pindex<-names(attributes(data)$index)
} else {
pindex<-names(data)[1:2]
}
if (class(data)[1]!="pdata.frame") {
data<-pdata.frame(data)
}
#model frame
mf<-model.frame(formula,data=data)
#model matrix - transformed data
mn<-model.matrix(formula,mf,model)
#define variable names
y.t.hat<-paste0(pname,".t.hat")
y.l.hat<-paste0(pname,".l.hat")
y.l<-names(mf)[1]
#transformed data of explanatory variables
#exclude variables that were droped in estimation
n<-names(estimate$aliased[estimate$aliased==F])
i<-match(n,colnames(mn))
X<-mn[,i]
#predict transformed outcome with X * beta
# p<- X %*% coef(estimate)
p<-crossprod(t(X),coef(estimate))
colnames(p)<-y.t.hat
if (levelconstr==T){
#old dataset with original outcome
od<-data.frame(
attributes(mf)$index,
data.frame(mf)[,1]
)
rownames(od)<-rownames(mf) #preserve row names from model.frame
names(od)[3]<-y.l
#merge old dataset with prediciton
nd<-merge(
od,
p,
by="row.names",
all.x=T,
sort=F
)
nd$Row.names<-as.integer(nd$Row.names)
nd<-nd[order(nd$Row.names),]
#construct predicted level outcome for FD estiamtions
if (model=="fd"){
#first observation from real data
i<-which(is.na(nd[,y.t.hat]))
nd[i,y.l.hat]<-NA
nd[i,y.l.hat]<-nd[i,y.l]
#fill values over all years
ylist<-unique(nd[,pindex[2]])[-1]
ylist<-as.integer(as.character(ylist))
for (y in ylist){
nd[nd[,pindex[2]]==y,y.l.hat]<-
nd[nd[,pindex[2]]==(y-1),y.l.hat] +
nd[nd[,pindex[2]]==y,y.t.hat]
}
}
if (model=="within"){
#group means of outcome
gm<-aggregate(nd[, pname], list(nd[,pindex[1]]), mean)
gl<-aggregate(nd[, pname], list(nd[,pindex[1]]), length)
nd<-cbind(nd,groupmeans=rep(gm$x,gl$x))
#predicted values + group means
nd[,y.l.hat]<-nd[,y.t.hat] + nd[,"groupmeans"]
}
if (model!="fd" && model!="within") {
stop('funciton works only for FD and FE estimations')
}
}
#results
results<-p
if (levelconstr==T){
results<-list(results,nd)
names(results)<-c("p","df")
}
return(results)
}
Testing the the function:
##packages
library(plm)
##test dataframe
#data structure
N<-4
G<-2
M<-5
d<-data.frame(
id=rep(1:N,each=M),
year=rep(1:M,N)+2000,
gid=rep(1:G,each=M*2)
)
#explanatory variable
d[,"x"]=runif(N*M,0,1)
#outcome
d[,"y"] = 2 * d[,"x"] + runif(N*M,0,1)
#panel data frame
d<-pdata.frame(d,index=c("id","year"))
##new data frame for out of sample prediction
dn<-d
dn$x<-rnorm(nrow(dn),0,2)
##estimate
#formula
f<- pFormula(y ~ x + factor(year))
#fixed effects or first difffernce estimation
e<-plm(f,data=d,model="within",index=c("id","year"))
e<-plm(f,data=d,model="fd",index=c("id","year"))
summary(e)
##fitted values of estimation
#transformed outcome prediction
predict(e)
c(pmodel.response(e)-residuals(e))
predict.out.plm(e,f,d,"fd")$p
# "level" outcome prediciton
predict.out.plm(e,f,d,"fd")$df$y.l.hat
#both
predict.out.plm(e,f,d,"fd")
##out of sampel prediciton
predict(e,newdata=d)
predict(e,newdata=dn)
# Error in crossprod(beta, t(X)) : non-conformable arguments
# if plm omits variables specified in the formula (e.g. one year in factor(year))
# it tries to multiply two matrices with different length of columns than regressors
# the new funciton avoids this and therefore is able to do out of sample predicitons
predict.out.plm(e,f,dn,"fd")
plm has now a predict.plm() function, although it is not documented/exported.
Note also that predict works on the transformed model (i.e. after doing the within/between/fd transformation), not the original one. I speculate that the reason for this is that it is more difficult to do prediction in a panel data framework. Indeed, you need to consider whether you are predicting:
new time periods, for existing individual and you used a individual-FE? Then you can add the prediction to the existing individual mean
new time periods, for new individual? Then you need to figure out which individual mean you are going to use?
the same is even more complicated is you use a random-effect model, as the effects are not easily derived
In the code below, I illustrate how to use fitted values, on the existing sample:
library(plm)
#> Loading required package: Formula
library(tidyverse)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
## produce a dataset of prediction, added to the group means
Produc_means <- Produc %>%
mutate(y = log(gsp)) %>%
group_by(state) %>%
transmute(y_mean = mean(y),
y = y,
year = year) %>%
ungroup() %>%
mutate(y_pred = predict(zz) + y_mean) %>%
select(-y_mean)
## plot it
Produc_means %>%
gather(type, value, y, y_pred) %>%
filter(state %in% toupper(state.name[1:5])) %>%
ggplot(aes(x = year, y = value, linetype = type))+
geom_line() +
facet_wrap(~state) +
ggtitle("Visualising in-sample prediction, for 4 states")
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
Created on 2018-11-20 by the reprex package (v0.2.1)
Looks like there is a new package to do in-sample predictions for a variety of models including plm
https://cran.r-project.org/web/packages/prediction/prediction.pdf
You can calculate the residuals via residuals(reg_name). From here, you can subtract them from your response variable and get the predicted values.

Resources