Performing VIF Test in R - r

I started using R studio a few days ago and I am struggling a bit to compute a VIF. Here is the situation:
I have a panel data and ran Fixed effect and Random effect regressions. I have one Dependent Variable (New_biz_density) and 2 Independent variables (Cost_to_start, Capital_requirements). I would like to check if my two independent variables present multicollinearity by computing their Variance Inflation Factor, both for Fixed and Random effect models.
I already installed some packages to perform the VIF (Faraway, Car) but did not manage to do it. Does anybody know how to do it?
Here is my script:
# install.packages("plm")
library(plm)
mydata<- read.csv("/Users/juliantabone/Downloads/DATAweakoutliers.csv")
Y <- cbind(new_biz_density)
X <- cbind(capital_requirements, cost_to_start)
# Set data as panel data
pdata <- plm.data(mydata, index=c("country_code","year"))
# Descriptive statistics
summary(Y)
summary(X)
# Pooled OLS estimator
pooling <- plm(Y ~ X, data=pdata, model= "pooling")
summary(pooling)
# Between estimator
between <- plm(Y ~ X, data=pdata, model= "between")
summary(between)
# First differences estimator
firstdiff <- plm(Y ~ X, data=pdata, model= "fd")
summary(firstdiff)
# Fixed effects or within estimator
fixed <- plm(Y ~ X, data=pdata, model= "within")
summary(fixed)
# Random effects estimator
random <- plm(Y ~ X, data=pdata, model= "random")
summary(random)
# LM test for random effects versus OLS
plmtest(pooling)
# LM test for fixed effects versus OLS
pFtest(fixed, pooling)
# Hausman test for fixed versus random effects model
phtest(random, fixed)

There seem to be two popular ways of calculating VIFs (Variance Inflation Factors, to detect collinearity among variables in regression) in R:
The vif() function in the car package, where the input is the model. This requires you to first fit a model before you can check for VIFs among variables in the model.
The corvif() function, where the input are the actual candidate explanatory variables (i.e. a list of variables, before the model is even fitted). This function is part of the AED package (Zuur et al. 2009), which has been discontinued. This one seems to work only on a list of variables, not on a fitted regression model.

To compute VIF refer function non_collinear_vars from metan package.
To compute the VIF between two variables Cost_to_start & Capital_requirements use,
library(metan)
non_collinear_vars(X)

Related

How to find the variance of fixed effects of feols regression?

I am wondering how to easily find the variation within the fixed effects used in a feols regression. In the below code, the fixed effects are type and CO2. How do I obtain the variance within each of the fixed effects (type and CO2)?
library(fixest)
library(car)
library(pander)
##Using the built-in CO2 data frame, run regression
i<- feols(conc ~ uptake + Treatment | Type, CO2, vcov = "hetero")
summary(i)

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

glm model with normal distribution and log link

In a university assignment I have a negative intercept in my lm regression, which makes my model hard to interpret. My supervisor advised me to try glm model a normal distribution and a log link in R to force my model to have a non negative intercept.
I have tried the following two ways with the glm and gamlss functions:
model1 <- glm(y ~ x, family = gaussian(link = "log"))
model2 <- gamlss(y ~ x, family=NO, data=df, mu.link= "log")
Both of them keep giving negative intercepts. Does anyone have an idea what I do wrong?

Does the metafor package report adjusted R^2 or just normal R^2?

I am conducting a meta-analysis using the metafor package in R. Comparing models using anova() function, I can see the R^2.
fit1 <- rma(data = dat, yi, vi, mods = fix1)
fit2 <- rma(data = dat, yi, vi, mods = fix1 + fix2)
anova(fit1, fit2)
Is this adjusted R-square or just a normal R-square?
Thank you.
Please read the documentation provided in the metafor package.
https://cran.r-project.org/web/packages/metafor/metafor.pdf
For the function anova.rma:
R2
amount of (residual) heterogeneity in the reduced model that is accounted for
in the full model (in percent). NA for fixed-effects models, if the amount of
heterogeneity in the reduced model is equal to zero, or for "rma.mv" objects.
This can be regarded as a pseudo R2
statistic (Raudenbush, 2009). Note that the
value may not be very accurate unless k is large (Lopez-Lopez et al., 2014).

Stata's xtlogit (fe, re) equivalent in R?

Stata allows for fixed effects and random effects specification of the logistic regression through the xtlogit fe and xtlogit re commands accordingly. I was wondering what are the equivalent commands for these specifications in R.
The only similar specification I am aware of is the mixed effects logistic regression
mymixedlogit <- glmer(y ~ x1 + x2 + x3 + (1 | x4), data = d, family = binomial)
but I am not sure whether this maps to any of the aforementioned commands.
The glmer command is used to quickly fit logistic regression models with varying intercepts and varying slopes (or, equivalently, a mixed model with fixed and random effects).
To fit a varying intercept multilevel logistic regression model in R (that is, a random effects logistic regression model), you can run the following using the in-built "mtcars" data set:
data(mtcars)
head(mtcars)
m <- glmer(mtcars$am ~ 1 + mtcars$wt + (1|mtcars$gear), family="binomial")
summary(m)
# and you can examine the fixed and random effects
fixef(m); ranef(m)
To fit a varying-intercept slope model in Stata, you of course use the xtlogit command (using the similar but not identical in-built "auto" data set in Stata):
sysuse auto
xtset gear_ratio
xtlogit foreign weight, re
I'll add that I find the entire reference to "fixed" versus "random" effects ambiguous, and I prefer to refer to the structure of the model itself (e.g., are the intercepts varying? which slopes are varying, if any? is the model nested in 2 levels or more? are the levels cross-classified or not?). For a similar view, see Andrew Gelman's thoughts on "fixed" versus "random" effects.
Update: Ben Bolker's excellent comment below points out that in R it's more informative when using predict commands to use the data=mtcars option instead of, say, the dollar notation:
data(mtcars)
m1 <- glmer(mtcars$am ~ 1 + mtcars$wt + (1|mtcars$gear), family="binomial")
m2 <- glmer(am ~ 1 + wt + (1|gear), family="binomial", data=mtcars)
p1 <- predict(m1); p2 <- predict(m2)
names(p1) # not that informative...
names(p2) # very informative!

Resources