plm-package ID-YEAR Clustered Standard Errors - r

I was looking for a way to do clustered standard errors based on ID-Year clusters (each ID-Year combination gets treated like a new cluster). I found that no such functions exist for plm objects, but I had an idea and I would like to know whether it makes sense:
In my plm formula, let's say I have
p <- plm(y~x+factor(year), df, model="within", index=("ID","Date"), effect="individual")
pce <- coeftest(p, vcov=vcovHC(p, method = "arellano", type="sss",cluster="group"))
Could I simply assign a LSDV model with an index which simply represents ID-Year combinations like this:
df$IDYEAR <- paste(df$ID,df$YEAR)
p1 <- plm(y~x+factor(year)+factor(ID), df, model="pooling", index=("IDYEAR"))
p1ce <- coeftest(p1, vcov=vcovHC(p1, method = "arellano", type="sss",cluster="group"))
This should estimate almost exactly the same model while tricking my plm function into thinking that the group level is IDYEAR so that I get the right standard errors. Is my thinking correct here?

I think, a minor adjustment to vcovDC should do
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
Pretty neat explanation here.
This should work for your LSDV example, too.

Related

How to normalize a Lmer model?

lmer:
mixed.lmer6 <- lmer(Size ~ (Time+I(Time^2))*Country*STemperature +
(1|Country:Locality)+ (1|Locality:Individual)+(1|Batch)+
(1|Egg_masses), REML = FALSE, data = data_NoNA)
residuals:
plot_model(mixed.lmer6, type = "diag")
Tried manual log,power, sqrt transformations in my formula but no improvement and I also can not find a suitable automatic transformation R function such as BoxCox (which does not work for LMER's)
Any help or tips would be appreciated
This might be better suited for CrossValidated ("what should I do?" is appropriate for CV; "how should I do it?" is best for Stack Overflow), but I'll take a crack.
The Q-Q plot is generally the last/least important diagnostic you should look at (the order should be approximately (1) check for significant bias/missed patterns in the mean [fitted vs. residual, residual vs. covariates]; (2) check for outliers/influential points [leverage, Cook's distance]; (3) check for heteroscedasticity [scale-location plot]; (4) check distributional assumptions [Q-Q plot]). The reason is that any of the "upstream" failures (e.g. missed patterns) will show up in the Q-Q plot as well; resolving them will often resolve the apparent non-Normality.
If you can fix the distributional assumptions by fixing something else about the model (adding covariates/adding interactions/adding polynomial or spline terms/removing outliers), then do that.
you could code your own brute-force Box-Cox, something like
fitted_model <- lmer(..., data = mydata)
bcfun <- function(lambda, resp = "y") {
y <- mydata[[resp]]
mydata$newy <- if (lambda==0) log(y) else (y^lambda -1)/lambda
## https://stats.stackexchange.com/questions/261380/how-do-i-get-the-box-cox-log-likelihood-using-the-jacobian
log_jac <- sum((lambda-1)*log(y))
newfit <- update(fitted_model, newy ~ ., data = mydata)
return(-2*(c(logLik(newfit))+ log_jac))
}
lambdavec <- seq(-2, 2, by = 0.2)
boxcox <- vapply(lambdavec, bcfun, FUN.VALUE = numeric(1))
plot(lambdavec, boxcox - min(boxcox))
(lightly tested! but feel free to let me know if it doesn't work)
if you do need to fit a mixed model with a heavy-tailed residual distribution (e.g. Student t), the options are fairly limited. The brms package can fit such models (but takes you down the Bayesian/MCMC rabbit hole), and the heavy package (currently archived on CRAN) will work, but doesn't appear to handle crossed random effects.

Is it possible to use lqmm with a mira object?

I am using the package lqmm, to run a linear quantile mixed model on an imputed object of class mira from the package mice. I tried to make a reproducible example:
library(lqmm)
library(mice)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
"Error in lqmm(Ozone ~ Solar.R + Wind + Temp + Day, random = ~1, tau = 0.5, :
`data' must be a data frame"
Yes, it is possible to get lqmm() to work in mice. Viewing the code for lqmm(), it turns out that it's a picky function. It requires that the data argument is supplied, and although it appears to check if the data exists in another environment, it doesn't seem to work in this context. Fortunately, all we have to do to get this to work is capture the data supplied from mice and give it to lqmm().
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
The explanation is that ls() gets the names of the variables available, mget() gets those variables as a list, and data.frame() converts them into a data frame.
The next problem you're going to find is that mice::pool() requires there to be tidy() and glance() methods to properly pool the multiple imputations. It looks like neither broom nor broom.mixed have those defined for lqmm. I threw together a very quick and dirty implementation, which you could use if you can't find anything else.
To get pool(fit2) to run you'll need to create the function tidy.lqmm() as below. Then pool() will assume the sample size is infinite and perform the calculations accordingly. You can also create the glance.lqmm() function before running pool(fit2), which will tell pool() the residual degrees of freedom. Afterwards you can use summary(pooled) to find the p-values.
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x, R = 2)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
Note: lqmm uses bootstrapping to estimate the standard error. By default it uses R = 50 bootstrapping replicates, which I've copied in the tidy.lqmm() function. You can change that line to increase the number of replicates if you like.
WARNING: Use these functions and the results with caution. I know just enough to be dangerous. To me it looks like these functions work to give sensible results, but there are probably intricacies that I'm not aware of. If you can find a more authoritative source for similar functions that work, or someone who is familiar with lqmm or pooling mixed models, I'd trust them more than me.

Time-dependent covariates- is there something wrong with this code? (R program)

I am checking a few of my Cox multivariate regression analyses' proportional hazard assumptions using time-dependent co-variates, using the survival package. The question is looking at survival in groups with different ADAMTS13 levels (a type of enzyme).
Could I check if something is wrong with my code itself? It keeps saying Error in tt(TMAdata$ADAMTS13level.f) : could not find function "tt" . Why?
Notably, ADAMTS13level.f is a factor variable.
cox_multivariate_survival_ADAMTS13 <- coxph(Surv(TMAdata$Daysalive, TMAdata$'Dead=1')
~TMAdata$ADAMTS13level.f
+TMAdata$`Age at diagnosis`
+TMAdata$CCIwithoutage
+TMAdata$Gender.f
+TMAdata$`Peak Creatinine`
+TMAdata$DICorcrit.f,
tt(TMAdata$ADAMTS13level.f),
tt = function(x, t, ...)
{mtrx <- model.matrix(~x)[,-1]
mtrx * log(t)})
Thanks- starting with the fundamentals of my actual code or typos- I have tried different permutations to no avail yet.
#Limey was on the right track!
The time-transformed version of ADAMTS13level.f needs to be added to the model, instead of being separated into a separate argument of coxph(...).
The form of coxph call when testing the time-dependent categorical variables is described in How to use the timeSplitter by Max Gordon.
Other helpful documentation:
coxph - fit proportional hazards regression model
cox_multivariate_survival_ADAMTS13 <-
coxph(
Surv(
Daysalive,
'Dead=1'
) ~
ADAMTS13level.f
+ `Age at diagnosis`
+ CCIwithoutage
+ Gender.f
+ `Peak Creatinine`
+ DICorcrit.f
+ tt(ADAMTS13level.f),
tt = function(x, t, ...) {
mtrx <- model.matrix(~x)[,-1]
mtrx * log(t)
},
data = TMAdata
)
p.s. with the original data, there was also a problem because Daysalive included a zero (0) value, which eventually resulted in an 'infinite predictor' error from coxph, probably because tt transformed the data using a log(t). (https://rdrr.io/github/therneau/survival/src/R/coxph.R)

Partial residual plots for linear model including an interaction term

My model includes one response variable, five predictors and one interaction term for predictor_1 and predictor_2. I would like to plot partial residual plots for every predictor variable which I would normally realize using the crPlots function from the package car. Unfortunately the function complains that it doesn't work with models that include interaction terms.
Is there another way of doing what I want?
EDIT: I created a small example illustrating the problem
require(car)
R <- c(0.53,0.60,0.64,0.52,0.75,0.66,0.71,0.49,0.52,0.59)
P1 <- c(3.1,1.8,1.8,1.8,1.8,3.2,3.2,2.8,3.1,3.3)
P2 <- c(2.1,0.8,0.3,0.5,0.4,1.3,0.5,1.2,1.6,2.1)
lm.fit1 <- lm(R ~ P1 + P2)
summary(lm.fit1)
crPlots(lm.fit1) # works fine
lm.fit2 <- lm(R ~ P1*P2)
summary(lm.fit2)
crPlots(lm.fit2) # not available
Another way to do this is to put the interaction term in as a separate variable (which avoids hacking the code for crPlot(...)).
df <- data.frame(R,P1,P2,P1.P2=P1*P2)
lm.fit1 <- lm(R ~ ., df)
summary(lm.fit1)
crPlots(lm.fit1)
Note that summary(lm.fit1) yeilds exactly the same result as summary(lm(R~P1*P2,df)).
I must admit i'm not that familiar with partial residual plots so i'm not entirely sure what the proper interpretation of them should be given an interaction term. But basically, the equivalent of
crPlot(lm.fit1, "P1")
is
x <- predict(lm.fit1, type="term", term="P1")
y <- residuals(lm.fit1, type="partial")[,"P1"]
plot(x, y)
abline(lm(y~x), col="red", lty=2)
loessLine(x,y,col="green3",log.x = FALSE, log.y = FALSE, smoother.args=list())
so really, there's no real reason the same idea couldn't work with an interaction term as well. We just leave the partial contribution from a variable due to the interaction as a separate entity and just focus on the non-interaction contribution. So what i'm going to do is just take out the check for the interaction term and then we can use the function. Assuming that
body(car:::crPlot.lm)[[11]]
# if (any(attr(terms(model), "order") > 1)) {
# stop("C+R plots not available for models with interactions.")
# }
we can copy and modify to create a new function with out the check
crPlot2 <- car:::crPlot.lm
body(crPlot2) <- body(crPlot2)[-11]
environment(crPlot2) <- asNamespace("car")
And then we can run
layout(matrix(1:2, ncol=2))
crPlot2(lm.fit2, "P1")
crPlot2(lm.fit2, "P2")
to get
I'm sure the authors had a good reason for not incorporating models with interaction terms so use this hack at your own risk. It's just unclear to me what should happen to the residual from the interaction term when making the plot.

R: How to make column of predictions for logistic regression model?

So I have a data set called x. The contents are simple enough to just write out so I'll just outline it here:
the dependent variable, Report, in the first column is binary yes/no (0 = no, 1 = yes)
the subsequent 3 columns are all categorical variables (race.f, sex.f, gender.f) that have all been converted to factors, and they're designated by numbers (e.g. 1= white, 2 = black, etc.)
I have run a logistic regression on x as follows:
glm <- glm(Report ~ race.f + sex.f + gender.f, data=x,
family = binomial(link="logit"))
And I can check the fitted probabilities by looking at summary(glm$fitted).
My question: How do I create a fifth column on the right side of this data set x that will include the predictions (i.e. fitted probabilities) for Report? Of course, I could just insert the glm$fitted as a column, but I'd like to try to write a code that predicts it based on whatever is in the race, sex, gender columns for a more generalized use.
Right now I the follow code which I will hope create a predicted column as well as lower and upper bounds for the confidence interval.
xnew <- cbind(xnew, predict(glm5, newdata = xnew, type = "link", se = TRUE))
xnew <- within(xnew, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
Unfortunately I get the error:
Error in eval(expr, envir, enclos) : object 'race.f' not found
after the cbind code.
Anyone have any idea?
There appears to be a few typo in your codes; First Xnew calls on glm5 but your model as far as I can see is glm (by the way using glm as name of your output is probably not a good idea). Secondly make sure the variable race.f is actually in the dataset you wish to do the prediction from. My guess is R can't find that variable hence the error.

Resources