PLM in R with time invariant variable - r

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.

You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Related

Coefficient value of covariate in Cox PH model is too big

I am trying to develop Cox PH model with time-varying covariates in R. I use coxph function from survival package. There was not any trouble during estimation process, though coefficient value of one covariates is too large, in particular, 2.5e+32.
I can't guess what is reason of this problem and how to tackle it. This variable is nonstationary and proportional assumption is violated. Does either of this facts may cause such a big value of coefficient?
More information could help framing your problem.
Anyway, I doubt non-proportionality is to blame. It would imply that you have some outliers heavily biasing your coefficient beyond reasonable expectations. You could give this a quick look by plotting the output of cox.zph.
Another possible explanation is that this rather depends on the unit of measure you used to define your covariate. Can the magnitude of the coefficient be meaningfully interpreted? If so, you could simply re-scale/standardise/log-transform that covariate to obtain a 'more manageable' coefficient (if this is theoretically appropriate).
This could also be due to the so called 'complete separation', which has been discussed here and here.

How to interpret a VAR model without sigificant coefficients?

I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance

Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding?

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks
You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

Resources