Why I can't include country dummies in my fixed effects model? - r

First of all I build the following dataframe (country_Id as factor variable and year as numeric):
mydata = pdata.frame(mydata, index = c("country_Id","year"),row.names = TRUE)
Then I check it with:
index(mydata)
pdim(mydata)
is.pconsecutive(mydata)
class(mydata)
Everything seems to be fine but I want to include country-dummies in the fixed-effects model it does not work
femodel_1 <- plm(y~x + ldvx + factor(country_Id) , data= mydata, model = "within")
And another problem is that my random model shows that the individual variance is 0
remodel_1 <- plm(y~x + ldvx , data= mydata, model = "random")
Unfortunately I can not find the problem.

Your one-way fixed effect model already takes care of the country fixed effects. This is why you cannot add them to the model's formula again - it would not make sense- they just disappear. So you are fine with:
femodel_1 <- plm(y~x + ldvx, data= mydata, model = "within")
If you want to values of the country fixed effects, use fixef(fe_model1).
About your random effect model:
The Swamy-Arora RE estimator does not guarantee positive variance estimates (read up on this in a good econometrics text book or look here: https://stats.stackexchange.com/questions/176827/error-in-plm-random-effects-swamy-arora-swar-estimator-with-lagged-dependent/181444#181444). You can try to change your model (if it makes sense) and/or change your data a bit (e.g., more observations). Also, you can try to switch to a different RE estimator - plm offers a few.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Is there a way to both include PCSE and Prais-Winsten correction in a fixed effects model in R (similar to the xtpcse function in Stata)?

I want to estimate a fixed effects model while using panel-corrected standard errors as well as Prais-Winsten (AR1) transformation in order to solve panel heteroscedasticity, contemporaneous spatial correlation and autocorrelation.
I have time-series cross-section data and want to perform regression analysis. I was able to estimate a fixed effects model, panel corrected standard errors and Prais-winsten estimates individually. And I was able to include panel corrected standard errors in a fixed effects model. But I want them all at once.
# Basic ols model
ols1 <- lm(y ~ x1 + x2, data = data)
summary(ols1)
# Fixed effects model
library('plm')
plm1 <- plm(y ~ x1 + x2, data = data, model = 'within')
summary(plm1)
# Panel Corrected Standard Errors
library(pcse)
lm.pcse1 <- pcse(ols1, groupN = Country, groupT = Time)
summary(lm.pcse1)
# Prais-Winsten estimates
library(prais)
prais1 <- prais_winsten(y ~ x1 + x2, data = data)
summary(prais1)
# Combination of Fixed effects and Panel Corrected Standard Errors
ols.fe <- lm(y ~ x1 + x2 + factor(Country) - 1, data = data)
pcse.fe <- pcse(ols.fe, groupN = Country, groupT = Time)
summary(pcse.fe)
In the Stata command: xtpcse it is possible to include both panel corrected standard errors and Prais-Winsten corrected estimates, with something allong the following code:
xtpcse y x x x i.cc, c(ar1)
I would like to achieve this in R as well.
I am not sure that my answer will completely address your concern, these days I've been trying to deal with the same problem that you mention.
In my case, I ran the Prais-Winsten function from the package prais where I included my model with the fixed effects. Afterwards, I correct for heteroskedasticity using the function vcovHC.prais which is analogous to vcovHC function from the package sandwich.
This basically will give you White's/sandwich heteroskedasticity-consistent covariance matrix which, if you later fit into the function coeftest from the package lmtest, it will give you the table output with the corrected standard errors. Taking your posted example, see below the code that I have used:
# Prais-Winsten estimates with Fixed Effects
library(prais)
prais.fe <- prais_winsten(y ~ x1 + x2 + factor(Country), data = data)
library(lmtest)
prais.fe.w <- coeftest(prais.fe, vcov = vcovHC.prais(prais.fe, "HC1")
h.m1 # run the object to see the output with the corrected standard errors.
Alas, I am aware that the sandwhich heteroskedasticity-consistent standard errors are not exactly the same as the Beck and Katz's PCSEs because PCSE deals with panel heteroskedasticity while sandwhich SEs addresses overall heteroskedasticity. I am not totally sure in how much these two differ in practice, but something is something.
I hope my answer was somehow helpful, this is actually my very first answer :D

Estimating variance attributed a fixed effect

Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429

How to include a year fixed effect (in a year-quarter panel data) in R using plm function?

Thank you all in advance for your help. My question is essentially a "bump" of the following question: R: plm -- year fixed effects -- year and quarter data.
Basically, I was wondering if there is anyway using the plm function in R to include a fixed effect that is not at the same level as the data. For example, suppose you have the following data
library(plm)
id <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
year <- c(1999,1999,1999,1999,2000,2000,2000,2000,1999,1999,1999,1999,2000,2000,2000,2000)
qtr <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
y <- rnorm(16, mean=0, sd=1)
x <- rnorm(16, mean=0, sd=1)
data <- data.frame(id=id,year=year,qtr=qtr,y_q=paste(year,qtr,sep="_"),y=y,x=x)
This is a panel data set, with the cross sectional unit marked as "id" and the time unit at the year-quarter level. However, I only want to actually include a fixed effect for year, I do not want to include a fixed effect for year-quarter. However, if you try running this regression,
reg1 <- plm(y ~ x, data=data,index=c("id", "year"), model="within",effect="time")
I get the following error:
duplicate couples (time-id) Error in pdim.default(index[[1]],
index[[2]]) :
Now, to add to the post I previously linked, if you are using a fixed effects model, one way to get around this is to manually put in the fixed effects as a vector of dummy variables, and just use pooled cross section regression. For example,
reg1 <- plm(y ~ x + factor(id) + factor(year), data=data,index=c("id", "year"), model="pooling",effect="time")
If that works for you, then great! However, this solution does not work for me because I definitely need to use the plm function. The reason why is because I actually want to put in a year random effect, and I'm not sure how to do that "manually". Is there a work around for this using the plm function?
Thanks!
Vincent
You will need to make the combination of year and quarter the time dimension of your data set, i.e., use y_q as the second index variable.
This model:
reg_q <- plm(y ~ x, data=data, index=c("id", "y_q"), model="within", effect="time")
will take care of quartly effects only.
This model:
reg_ind_year <- plm(y ~ x + factor(year), data=data, index=c("id", "y_q"), model="within", effect="individual")
will take care of individual effects and yearly effects (note the inclusion of factor(year) in the model). It does no take quarterly effects into account.

How to Code Selection for Bootstrap Probit Models in R

This question regards how to code variable selection in a probit model with marginal effects (either directly or by calling some pre-existing package).
I'm conducting a little probit regression of the effects of free and commercial availability of films on the level of piracy of those films as a TLAPD-related blog post.
The easy way of running a probit in R is typically through glm, i.e.:
probit <- glm(y ~ x1 + x2, data=data, family =binomial(link = "probit"))
but that's problematic for interpretation because it doesn't supply marginal effects.
Typically, if I want marginal effects from a probit regression I define this function (I don't recall the original source, but it's a popular function that gets re-posted a lot):
mfxboot <- function(modform,dist,data,boot=500,digits=3){
x <- glm(modform, family=binomial(link=dist),data)
# get marginal effects
pdf <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
x1 <- glm(modform, family=binomial(link=dist),samp1)
pdf1 <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){
res1 <- res[2:nrow(res),]
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
rownames(res2) <- rownames(res1)
} else {
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
rownames(res2) <- rownames(res)
}
colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
return(res2)
}
Then run the regression like this:
mfxboot(modform = "y ~ x1 + x2",
dist = "probit",
data = piracy)
but using that approach I don't know that I can run any variable selection algorithms like forward, backward, stepwise, etc.
What's the best way to solve this problem? Is there a better way of running probits in R that reports marginal effects and also allows for automated model selection? Or should I focus on using mfxboot and doing variable selection with that function?
Thanks!
It is not clear why there is a problem. Model (variable) selection and computing of the marginal effects for a given model are sequential, and there is no reason to try to combine the two.
Here is how you might go about computing marginal effects and their bootstrapped standard effects post model (variable) selection:
Perform variable selection using your preferred model selection procedure (including bootstrap model selection techniques as discussed below, not to be confused with the bootstrap you will use to compute the standard errors of the marginal effects for the final model).
Here is an example on the dataset supplied in this question. Note also that this is in no way an endorsement of the use of stepwise variable selection methods.
#================================================
# read in data, and perform variable selection for
# a probit model
#================================================
dfE = read.csv("ENAE_Probit.csv")
formE = emploi ~ genre +
filiere + satisfaction + competence + anglais
glmE = glm(formula = formE,
family = binomial(link = "probit"),
data = dfE)
# perform model (variable) selection
glmStepE = step(object = glmE)
Now pass the selected model to a function that computes the marginal effects.
#================================================
# function: compute marginal effects for logit and probit models
# NOTE: this assumes that an intercept has been included by default
#================================================
fnMargEffBin = function(objBinGLM) {
stopifnot(objBinGLM$family$family == "binomial")
vMargEff = switch(objBinGLM$family$link,
probit = colMeans(outer(dnorm(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1]),
logit = colMeans(outer(dlogis(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1])
)
return(vMargEff)
}
# test the function
fnMargEffBin(glmStepE)
Here is the output:
> fnMargEffBin(glmStepE)
genre filiere
0.06951617 0.04571239
To get at the standard errors of the marginal effects, you could bootstrap the marginal effects, using, for example, the Boot function from the car function since it provides such a neat interface to bootstrap derived statistics from glm estimates.
#================================================
# compute bootstrap std. err. for the marginal effects
#================================================
margEffBootE = Boot(object = glmStepE, f = fnMargEffBin,
labels = names(coef(glmE))[-1], R = 100)
summary(margEffBootE)
Here is the output:
> summary(margEffBootE)
R original bootBias bootSE bootMed
genre 100 0.069516 0.0049706 0.045032 0.065125
filiere 100 0.045712 0.0013197 0.011714 0.048900
Appendix:
As a matter of theoretical interest, there are two ways to interpret your bootstrapped variable selection ask.
You can perform model selection (variable selection) by using as a measure of fit a bootstrap model fit criteria. The theory for this is outlined in Shao (1996), and requires a subsampling approach.
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.
You can perform variable selection on multiple bootstrap samples, and arrive at either one best model by looking at the variables retained across the multiple bootstrap model selections, or use a model averaging estimator. The theory for this is discussed in Austin and Tu (2004).
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.

Resources