Partial residual plots for linear model including an interaction term - r

My model includes one response variable, five predictors and one interaction term for predictor_1 and predictor_2. I would like to plot partial residual plots for every predictor variable which I would normally realize using the crPlots function from the package car. Unfortunately the function complains that it doesn't work with models that include interaction terms.
Is there another way of doing what I want?
EDIT: I created a small example illustrating the problem
require(car)
R <- c(0.53,0.60,0.64,0.52,0.75,0.66,0.71,0.49,0.52,0.59)
P1 <- c(3.1,1.8,1.8,1.8,1.8,3.2,3.2,2.8,3.1,3.3)
P2 <- c(2.1,0.8,0.3,0.5,0.4,1.3,0.5,1.2,1.6,2.1)
lm.fit1 <- lm(R ~ P1 + P2)
summary(lm.fit1)
crPlots(lm.fit1) # works fine
lm.fit2 <- lm(R ~ P1*P2)
summary(lm.fit2)
crPlots(lm.fit2) # not available

Another way to do this is to put the interaction term in as a separate variable (which avoids hacking the code for crPlot(...)).
df <- data.frame(R,P1,P2,P1.P2=P1*P2)
lm.fit1 <- lm(R ~ ., df)
summary(lm.fit1)
crPlots(lm.fit1)
Note that summary(lm.fit1) yeilds exactly the same result as summary(lm(R~P1*P2,df)).

I must admit i'm not that familiar with partial residual plots so i'm not entirely sure what the proper interpretation of them should be given an interaction term. But basically, the equivalent of
crPlot(lm.fit1, "P1")
is
x <- predict(lm.fit1, type="term", term="P1")
y <- residuals(lm.fit1, type="partial")[,"P1"]
plot(x, y)
abline(lm(y~x), col="red", lty=2)
loessLine(x,y,col="green3",log.x = FALSE, log.y = FALSE, smoother.args=list())
so really, there's no real reason the same idea couldn't work with an interaction term as well. We just leave the partial contribution from a variable due to the interaction as a separate entity and just focus on the non-interaction contribution. So what i'm going to do is just take out the check for the interaction term and then we can use the function. Assuming that
body(car:::crPlot.lm)[[11]]
# if (any(attr(terms(model), "order") > 1)) {
# stop("C+R plots not available for models with interactions.")
# }
we can copy and modify to create a new function with out the check
crPlot2 <- car:::crPlot.lm
body(crPlot2) <- body(crPlot2)[-11]
environment(crPlot2) <- asNamespace("car")
And then we can run
layout(matrix(1:2, ncol=2))
crPlot2(lm.fit2, "P1")
crPlot2(lm.fit2, "P2")
to get
I'm sure the authors had a good reason for not incorporating models with interaction terms so use this hack at your own risk. It's just unclear to me what should happen to the residual from the interaction term when making the plot.

Related

How to normalize a Lmer model?

lmer:
mixed.lmer6 <- lmer(Size ~ (Time+I(Time^2))*Country*STemperature +
(1|Country:Locality)+ (1|Locality:Individual)+(1|Batch)+
(1|Egg_masses), REML = FALSE, data = data_NoNA)
residuals:
plot_model(mixed.lmer6, type = "diag")
Tried manual log,power, sqrt transformations in my formula but no improvement and I also can not find a suitable automatic transformation R function such as BoxCox (which does not work for LMER's)
Any help or tips would be appreciated
This might be better suited for CrossValidated ("what should I do?" is appropriate for CV; "how should I do it?" is best for Stack Overflow), but I'll take a crack.
The Q-Q plot is generally the last/least important diagnostic you should look at (the order should be approximately (1) check for significant bias/missed patterns in the mean [fitted vs. residual, residual vs. covariates]; (2) check for outliers/influential points [leverage, Cook's distance]; (3) check for heteroscedasticity [scale-location plot]; (4) check distributional assumptions [Q-Q plot]). The reason is that any of the "upstream" failures (e.g. missed patterns) will show up in the Q-Q plot as well; resolving them will often resolve the apparent non-Normality.
If you can fix the distributional assumptions by fixing something else about the model (adding covariates/adding interactions/adding polynomial or spline terms/removing outliers), then do that.
you could code your own brute-force Box-Cox, something like
fitted_model <- lmer(..., data = mydata)
bcfun <- function(lambda, resp = "y") {
y <- mydata[[resp]]
mydata$newy <- if (lambda==0) log(y) else (y^lambda -1)/lambda
## https://stats.stackexchange.com/questions/261380/how-do-i-get-the-box-cox-log-likelihood-using-the-jacobian
log_jac <- sum((lambda-1)*log(y))
newfit <- update(fitted_model, newy ~ ., data = mydata)
return(-2*(c(logLik(newfit))+ log_jac))
}
lambdavec <- seq(-2, 2, by = 0.2)
boxcox <- vapply(lambdavec, bcfun, FUN.VALUE = numeric(1))
plot(lambdavec, boxcox - min(boxcox))
(lightly tested! but feel free to let me know if it doesn't work)
if you do need to fit a mixed model with a heavy-tailed residual distribution (e.g. Student t), the options are fairly limited. The brms package can fit such models (but takes you down the Bayesian/MCMC rabbit hole), and the heavy package (currently archived on CRAN) will work, but doesn't appear to handle crossed random effects.

Using predict() function to fit line of predicted values onto visualized model

I've created a linear mixed effects model where the explanatory variable is mother age and response variable is partner age along with several random effects.
I've had no problem creating the model but now that im using it to generate predicted values to fit to the graph I am having some difficulties!
As far as i can tell the code is correct and I cannot tell why this isnt working.
First i tried the following
pred <- expand.grid(Mother.age=EPPs$Mother.age)
head(pred)
tail(pred)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
head(pred) #here i have predicted and fitted values but they are not in order so I tried the following
pred[order(pred$fit),] ##this said that it worked but changed nothing
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred, col='red')
This essentially ends up in a massive back and forth red scribble instead of a line so i decided to manually order the data points in excel and try again
pred1 <- read.csv("predictions1.csv", header=TRUE) ##These are the ordered point
head(pred1)
tail(pred1)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred1, col='red')
Now this gets me very close to a straight line however it still has little 'step-ups' at each year on the x-axis, what I really want is a smooth flat line!
Any help would be appreciated - Im not sure what else I can do.
Oh whilst i'm here - any recommendations for a post hoc test for comparing two lme models to see if there is a significant difference? In this case it would be the exact same model except the response variable on the first is genetic father age and the second would be social father age (I want to show that extra pair males are consistently older than the cuckolded social males on average)
Thanks!!
Whole example
EPPs <- read.csv("EPPs.csv", header=TRUE) #data i am using
Turn these bad boys to factors to be used as random effects
EPPs$Mother <- as.factor(EPPs$Mother)
EPPs$Mother.Cohort <- as.factor(EPPs$Mother.Cohort)
EPPs$Brood.year <- as.factor(EPPs$Brood.year)
EPPs$Social.Father <- as.factor(EPPs$Social.Father)
Within subject centering
AveByInd <- function(x) mean(x)
d2 <- do.call("rbind", as.list(
by(EPPs, EPPs["Mother"], transform, AveMAge=AveByInd(Mother.age))))
par(mfrow=c(1,1))
hist(d2$AveMAge, xlab="Average mother age", ylab="Frequency", main="")
WithinIndCentr <- function(x) x-mean(x)
d2 <- do.call("rbind", as.list(
by(d2, d2["Mother"], transform, WithinMAge=WithinIndCentr(Mother.age))))
par(mfrow=c(1,1))
hist(d2$WithinMAge, xlab="Within-female centered age", ylab="Frequency",
main
="")
m6<-lmer(d2$Genetic.father.age~d2$WithinMAge+d2$AveMAge+(1|Mother)+
(1|Mother.Cohort)+(1|Brood.year)+(1|Social.Father), data=d2)
summary(m6)
The model I have selected
Above this point everything is working fine
Now I want to use the predictions generated by my model to add a line of best fit to my graph
I should mention I decided to try it with the Mother.age variable rather than the within-subject centered variables as I thought it would be more straight forward
Originally I was trying this
pred <- expand.grid(Mother.age=EPPs$Mother.age)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
head(pred)
pred[order(pred$fit),]
I was told I had to order the matrix in order for it to work however this doesnt seem to work
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred, col='red')
This results in a horrible back and forth mess of red lines
I was told by a friend that I needed to order the matrix to get it to work so I tried doing it manually in excel and reuploading
pred1 <- read.csv("predictions1.csv", header=TRUE) ##almost works
head(pred1)##predictions are in order
dev.off()
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
Using jitter because discrete data points and I want to give an idea of areas with lots of overlapping points
lines(fit ~ Mother.age, data=pred1, col='red')
Now I get a very almost straight line but it still has little 'steps' and I want it to be totally smooth

plm-package ID-YEAR Clustered Standard Errors

I was looking for a way to do clustered standard errors based on ID-Year clusters (each ID-Year combination gets treated like a new cluster). I found that no such functions exist for plm objects, but I had an idea and I would like to know whether it makes sense:
In my plm formula, let's say I have
p <- plm(y~x+factor(year), df, model="within", index=("ID","Date"), effect="individual")
pce <- coeftest(p, vcov=vcovHC(p, method = "arellano", type="sss",cluster="group"))
Could I simply assign a LSDV model with an index which simply represents ID-Year combinations like this:
df$IDYEAR <- paste(df$ID,df$YEAR)
p1 <- plm(y~x+factor(year)+factor(ID), df, model="pooling", index=("IDYEAR"))
p1ce <- coeftest(p1, vcov=vcovHC(p1, method = "arellano", type="sss",cluster="group"))
This should estimate almost exactly the same model while tricking my plm function into thinking that the group level is IDYEAR so that I get the right standard errors. Is my thinking correct here?
I think, a minor adjustment to vcovDC should do
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
Pretty neat explanation here.
This should work for your LSDV example, too.

How to write a function to check model assumptions for a linear model in R?

I'm making a lot of models in R and trying to check the model assumptions for all of them. It would be awesome if I could write a function to do it all in one go, but it doesn't seem to be working.
I have:
assumptionfunction <- function(y, modelobject){
plot(x)
plot(y, x$residuals)
qqnorm(x$residuals)
}
And I'm getting lots of errors.
Instead of creating your own function, you can use an existing one. The beautiful check_model() function from the performance package does just that:
library(performance)
library(see)
model <- lm(mpg ~ wt * cyl + gear, data = mtcars)
check_model(model)
If you insist on using some objective tests, there is the gvlma package.
library(gvlma)
gvlma(model)
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Value p-value Decision
Global Stat 1.770046 0.7780 Assumptions acceptable.
Skewness 0.746520 0.3876 Assumptions acceptable.
Kurtosis 0.003654 0.9518 Assumptions acceptable.
Link Function 0.927065 0.3356 Assumptions acceptable.
Heteroscedasticity 0.092807 0.7606 Assumptions acceptable.
Now if you don't like gvlma because it doesn't explicitly name the tests used and gives Skewness and Kurtosis but not overall normality from, say, Shapiro-Wilk, I made a convenience function. It gets all tests names and assumptions at once with the total number of assumptions that are not respected. You can take it and modify it to suit your needs.
# Load the function:
source("https://raw.githubusercontent.com/RemPsyc/niceplots/master/niceAssFunction.R")
View(niceAss(model))
Interpretation: (p) values < .05 imply assumptions are not respected.
Diagnostic is how many assumptions are not respected for a given model or variable.
Applied to a list of models:
# Define our dependent variables
(DV <- names(mtcars[-1]))
# Make list of all formulas
(formulas <- paste(DV, "~ mpg"))
# Make list of all models
models.list <- sapply(X = formulas, FUN = lm, data = mtcars, simplify = FALSE, USE.NAMES = TRUE)
# Make diagnostic table
(ass.table <- do.call("rbind", lapply(models.list, niceAss)))
# Use the Viewer for better results
View(ass.table)

"Vectorizing" this for-loop in R? (suppressing interaction main effects in lm)

When interactions are specified in lm, R includes main effects by default, with no option to suppress them. This is usually appropriate and convenient, but there are certain instances (within estimators, ratio LHS variables, among others) where this isn't appropriate.
I've got this code that fits a log-transformed variable to a response variable, independently within subsets of the data.
Here is a silly yet reproducible example:
id = as.factor(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,6,7,7,8,8,8,9,9,9,9,10))
x = rexp(length(id))
y = rnorm(length(id))
logx = log(x)
data = data.frame(id,y,logx)
for (i in data$id){
sub = subset(data, id==i) #This splits the data by id
m = lm(y~logx-1,data=sub) #This gives me the linear (log) fit for one of my id's
sub$x.tilde = log(1+3)*m$coef #This linearizes it and gives me the expected value for x=3
data$x.tilde[data$id==i] = sub$x.tilde #This puts it back into the main dataset
data$tildecoeff[data$id==i] = m$coef #This saves the coefficient (I use it elsewhere for plotting)
}
I want to fit a model like the following:
Y = B(X*id) +e
with no intercept and no main effect of id. As you can see from the loop, I'm interested in the expectation of Y when X=3, constrained the fit through the origin (because Y is a (logged) ratio of Y[X=something]/Y[X=0].
But if I specify
m = lm(Y~X*as.factor(id)-1)
there is no means of suppressing the main effects of id. I need to run this loop several hundred times in an iterative algorithm, and as a loop it is far too slow.
The other upside of de-looping this code is that it'll be much more convenient to get prediction intervals.
(Please, I don't need pious comments about how leaving out main effects and intercepts is improper -- it usually is, but I can promise that it isn't in this instance).
Thanks in advance for any ideas!
I think you want
m <- lm(y ~ 0 + logx : as.factor(id))
see R-intro '11.1 Defining statistical models; formulae'

Resources