Multicollinearity in a two ways fixed effect regression - r

I have a df based on a national survey conducted every two years; the time period is 2010-14 and I filtered the df in order to have only person that appears al least two times. In this way I have a panel df but unbalanced.
I run a regression to study which variables influence the participation in complementary pension (it is voluntary in my country). I run a one-side fixed effect regression and now I want to run a two side fixed effect regression (both individual and time).
The individual variable is uid and time variable is year. I used the plm package in r:
df.p <- plm.data(df, c("uid", "year")
and run the regression:
reg1 <- plm(pens ~ woman + age + I(age^2/100) + high + medium + nord + centre, model="within", effect="twoways", data=df.p)
where high and medium are dummies regarding the education level and nord and centre regard geographic location. For the sake of simplicity I omitted other variables that are present in the original model (20 variables).
After at least 1 hour of working I run the summary command:
summary(reg1)
after another hour of working I got the error:
Error in crossprod(t(X), beta) : non-conformable arguments
so I supposed there was a multicollinearity problem. So I check the multicollinearity with the correlation matrix:
p1 <- with(df, data.frame(woman=woman, age=age, high=high, medium=medium, nord=nord, centre=centre))
round(cor(p1),3)
Consider that I created the matrix using all the variables (here omitted for the sake of simplicity, as I wrote). I didn't find any relevant value. I also check for the variance inflation factor:
vif(p1)
and I got:
No variable from the 20 input variables has collinearity problem.
At this point I suppose the the collinearity problem could be determined by the fact that I run a two side regression but I don't know how to manage the problem.
Thanks in advance.

Related

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

How to run a generalized linear mixed model (GLMM) with multiple random factors?

I would like to run a GLMM with multiple random factors using the function glmer in package lme4.
I have a dataset on marine debris like this:
count density: numeric
year: categorical, two levels
round: categorical (each year has its own six rounds, so round is - nested in year)
monitoring site: categorical (data is measured on each monitoring site 6 times a year, so round is crossed with monitoring site)
waters: categorical (each waters has several different sites, so monitoring site is nested in waters)
material: categorical
I would like to know if the count densities of marine debris is significantly different between/among years, rounds, waters and materials. So I put-in this:
glmm <- glmer(count density~material*(1|year/round)*(1|waters/monitoring sites),
family=Poisson)
Could you please let me know if my formula is right?
And I can get nothing from the model, as I typed in:
glmm
It said:
Error: object 'glmm' not found
So what's the right way to use glmer?
At the very least (if your variable names really have spaces in them, which is generally a bad idea, see e.g. this question) you should try:
glmm <- glmer(`count density` ~ material+(1|year/round)+
(1|waters/`monitoring sites`),
family=poisson)
Also note that year won't work well as a random effect because it only has two levels (it's hard to estimate a variance from only two observations: see e.g. these simulations), so maybe
glmm <- glmer(`count density` ~ material+year+(1|year:round)+
(1|waters/`monitoring sites`),
family=poisson)
would be better.

How to convert Afex or car ANOVA models to lmer? Observed variables

In the afex package we can find this example of ANOVA analysis:
data(obk.long, package = "afex")
# estimate mixed ANOVA on the full design:
# can be written in any of these ways:
aov_car(value ~ treatment * gender + Error(id/(phase*hour)), data = obk.long,
observed = "gender")
aov_4(value ~ treatment * gender + (phase*hour|id), data = obk.long,
observed = "gender")
aov_ez("id", "value", obk.long, between = c("treatment", "gender"),
within = c("phase", "hour"), observed = "gender")
My question is, How can I write the same model in lme4?
In particular, I don't know how to include the "observed" term?
If I just write
lmer(value ~ treatment * gender + (phase*hour|id), data = obk.long,
observed = "gender")
I get an error telling that observed is not a valid option.
Furthermore, if I just remove the observed option lmer produces the error:
Error: number of observations (=240) <= number of random effects (=240) for term (phase * hour | id); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable.
Where in the lmer syntax do I specify the "between" or "within" variable?. As far as I know you just write the dependent variable on the left side and all other variables on the right side, and the error term as (1|id).
The package "car" uses the idata for the intra-subject variable.
I might not know enough about classical ANOVA theory to answer this question completely, but I'll take a crack. First, a couple of points:
the observed argument appears only to be relevant for the computation of effect size.
observed: ‘character’ vector indicating which of the variables are
observed (i.e, measured) as compared to experimentally
manipulated. The default effect size reported (generalized
eta-squared) requires correct specification of the obsered [sic]
(in contrast to manipulated) variables.
... so I think you'd be safe leaving it out.
if you want to override the error you can use
control=lmerControl(check.nobs.vs.nRE="ignore")
... but this probably isn't the right way forward.
I think but am not sure that this is the right way:
m1 <- lmer(value ~ treatment * gender + (1|id/phase:hour), data = obk.long,
control=lmerControl(check.nobs.vs.nRE="ignore",
check.nobs.vs.nlev="ignore"),
contrasts=list(treatment=contr.sum,gender=contr.sum))
This specifies that the interaction of phase and hour varies within id. The residual variance and (phase by hour within id) variance are confounded (which is why we need the overriding lmerControl() specification), so don't trust those particular variance estimates. However, the main effects of treatment and gender should be handled just the same. If you load lmerTest instead of lmer and run summary(m1) or anova(m1) it gives you the same degrees of freedom (10) for the fixed (gender and treatment) effects that are computed by afex.
lme gives comparable answers, but needs to have the phase-by-hour interaction constructed beforehand:
library(nlme)
obk.long$ph <- with(obk.long,interaction(phase,hour))
m2 <- lme(value ~ treatment * gender,
random=~1|id/ph, data = obk.long,
contrasts=list(treatment=contr.sum,gender=contr.sum))
anova(m2,type="marginal")
I don't know how to reconstruct afex's tests of the random effects.
As Ben Bolker correctly says, simply leave observed out.
Furthermore, I would not recommend to do what you want to do. Using a mixed model for a data set without replications within each cell of the design per participant is somewhat questionable as it is not really clear how to specify the random effects structure. Importantly, the Barr et al. maxim of "keep it maximal" does not work here as you realized. The problem is that the model is overparametrized (hence the error from lmer).
I recommend using the ANOVA. More discussion on exactly this question can be found on a crossvalidated thread where Ben and me discussed this more thoroughly.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Calculating marginal effects using mlogit in R

I have asked this question on Cross Validated, but think I might not get help as this is more of a programming question rather then theory/interpretation of the statistics.
I am trying to use the mlogit package in R and have been following the vignette trying to figure out how to get the marginal effects for my data. The example provided uses continuous variables, but I am wondering how to do this with categorical explanatory variables.
I have a value of risk which is continuous as a covariate, but I also have age, class, and gender as covariates. I want to see the marginal effects of "females" only or of "Young - females" in regard to risk. How would I do this?
The help documents say:
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
effects(m, covariate = "price", type = "rr", data = z)
effects(m, covariate = "catch", type = "ar", data = z)
I'm not sure how to manipulate the z data frame to get the mean risk for females or young females to then be able to calculate the marginal effects. Would I do them all separately? Do I somehow divide the data frame by age class (say I have just 2 age classes: young and old) so that I have 1 data frame for the young, and a separate new data frame for old, then calculate mean risk?
What I am hoping to get from my own data is to be able to interpret the magnitude of the likelihood in producing my categories of offspring. As an example, what I want to say is that if there is a 1 unit increase in risk, it is 10% more likely for older females to produce 2 offspring. As there is a 1 unit increase in risk, younger females are 15% more likely to produce 2 offspring.
I am not sure how to calculate the marginal effects by hand, and therefore am confused as to how to get a package to do it for me. Ive also been trying in the nnet library or the VGAM, but neither of these seem to give a great deal of help either.
I sort of got an answer - not sure if its the best, but it worked. My covariate that I was interested in just so happened to be 2 classes - which means I could turn the covariate into a binary 0,1 numeric response. Therefore, when I rerun the code, I could then calculate the mean for this "categorical" variable.
However, I think that I am missing the point with this marginal effect and am likely using it or trying to interpret it incorrectly.

Resources