Panel VAR estimation using GMM on panelvar - r

I've been trying to estimate a pVAR using GMM on R using the package panelvar. I'm estimating a dynamic panel VAR with two-step GMM using first-differences.
I have a balanced panel with 378 with a group variable (id) and a time variable (year). I have 14 observations per group (unvaried) and 27 groups in total. In total, I have 120 instruments. I'm a bit concerned about the results of the Hansen J-test and I'm looking for some explanations: I have a Hansen J-test statistic of 0 with a p-value of 1. To my understanding, this would mean that the model is correctly specified. But the fact that the p-value is very high (1.000), it might mean that something deeper is going on.
In my estimation, I have 7 dependent variables, 2 exogenous variables, and I'm using 4 lagged instruments per dependent variable. Why is the p-value of the Hansen test very high?
Thanks in advance!

Related

Predict with only one observation with randomForest in R

I am studying loan default prediction. I am currently using R's "randomForest" package. My first model had a accuracy of 98% with sensitivity of 0.98 and specificity of 0.97 on the test data with the "predict" command.
The training and testing data had an "n" of 2865 and 319 observations, respectively.
In a real situation, where I would like to predict the probability of loan default for just one company, ie only 1 observation in the test data, would I have a problem?
The dataset I used contains only 8 predictor variables and 1 predicted variable. According to the literature, there are many more variables to be considered. Why did I get good results with just a small dataset I used? Seems "weird" to me.

What post-hoc test should be used for a glmer model with a continious and a categorical predictor variable?

I'm a bit of a newbie with stats and R, so need a bit of direction to find a suitable post-hoc test for my glmer model.
The model has a binary dependent variable (absent/present) and the predictor variables are interactive terms between a continuous variable(eg temp) and a categorical variable (species, n=3). Only interactive terms, rather than the continuous factor in isolation, produce significant results when an anova is run on the model. Species by itself has a large effect because one species is much rarer than the others. I'm trying to tease apart how the presence of these species varies across pH and between species.
I've tried lsmeans test with Tukey, and Firth's Bias-Reduced Logistic Regression, emmeans. I ran the effects function on the interactive terms, so had a rough expectation of what a post hoc could show, but the results logistf (firth's) have produced I was not expecting. Emmeans and tukey both gave the same results and ignored the continuous variable I assume because it's not a factor.
When I run firth's regression it produces chi-squared and p values that are either infinity for chi values or the p values astronomically small, even though what I saw through effects suggested no significant difference. I can't tell with the interactive term if there truly is an effect of the environmental variable or if the significant effect is because of the difference in species. Based on what I have seen of the logistf function, I didn't think it would produce a chi-square score. Is this an issue in coding or is it because of my data?
If I wasn't clear enough about something please let me know and if anyone has any suggestions or advice, they would be massively appreciated. Thanks!
The model and test code I used are below:
###glmer model
Large<-glmer(Abs.Pres~ Species:Q.Depth+Species:Conductivity+Species:Temp+Species:pH+Species:DO.P+(1|QID),
nAGQ=0,
family=binomial,
data=Stacked_Pref)
anova(Large)
Output:Analysis of Variance Table
npar Sum Sq Mean Sq F value
Species:Q.Depth 3 234.904 78.301 78.3014
Species:Conductivity 3 32.991 10.997 10.9970
Species:Temp 3 39.001 13.000 13.0004
Species:pH 3 25.369 8.456 8.4562
Species:DO.P 3 34.930 11.643 11.6434
###Firths
Lp<-logistf(Abs.Pres~Species:pH, data=Stacked_Pref, contrasts.arg=list(pH="contr.treatment", Species="contr.sum"))
> Lp
logistf(formula = Abs.Pres ~ Species:pH, data = Stacked_Pref,
contrasts.arg = list(pH = "contr.treatment", Species = "contr.sum"))
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood
coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) 1.9711411 0.57309880 0.8552342 3.1015114 12.09107 5.066380e-04
SpeciesGoby:pH -0.3393185 0.07146049 -0.4804047 -0.2003108 23.31954 1.371993e-06
SpeciesMosquito:pH -0.3001385 0.07127771 -0.4408186 -0.1614419 18.24981 1.937453e-05
SpeciesRFBE:pH -0.4771393 0.07232469 -0.6200179 -0.3365343 45.73750 1.352096e-11
Likelihood ratio test=267.0212 on 3 df, p=0, n=3945

Summary measurement lost when adding mods to rma.glmm

At this moment I'm trying to calculate the (adjusted)IRLM with the rma.glmm function of the metafor package.
My data is a dataframe that looks like the following:
head(data)
patient-years events age
1 180.0000 4 NA
2 116.2500 13 51.83
3 66.2500 6 48.00
4 423.6333 21 58.00
5 142.1783 7 53.20
6 1117.3167 72 59.90
The function to calculate IRLM works fine:
y=rma.glmm(xi=events, ti=patent-years, data=data, measure="IRLN",method="ML")
And gives me the following forest plot:
metafor::forest.rma (y)
Forest plot
However, when I want to adjust my model:
nh=rma.glmm(xi=events,ti=patient-years, data=datanh,
measure="IRLN", mods = ~ age , method="ML")
(Where age is a numeric vector)
The summary measure is lost
Adjusted forest
I've tried all I can think of, but really don't know how to fix this. Do you have any suggestions?
When you add a moderator to the model, there is no longer the effect (or to be precise, the average effect in a random-effects model). The size of the average effect then depends on the value of the moderator. The gray-shaded polygons in the forest plot then reflect the estimated average effects corresponding to the values of 'age' for the included studies.
You could compute the predicted average effect for a particular value of age with the predict() function, i.e.,:
predict(nh, newmods = <age value>, transf=exp)
(transf=exp to obtain the estimated average IR for the specified age value).
Some might plug the average of the age values observed in the studies into and interpret this as an adjusted estimate. One can debate whether this terminology ('adjusted effect') is correct.

Estimation residuals from VAR estimation (vars package)

I'm currently estimating a VAR model followed by the estimation of generalized impulse response functions. To obtain SE of those, I'm supposed to do some bootstrapping first.
This procedure starts with "estimating the parameters of the VAR model and extracting the estimation residuals, denoted Ût."
Now, I'm estimating my var model with the vars package as follows
varendoA<-data.frame(value_ts,value2_ts, price_ts, price2_ts)
library(vars)
fitvar<- VAR(varendo, type = c("both"), season = christmas, lag.max = 12,ic = c("AIC"))
summary(fitvar)
The model contains 5 variables with 104 observations, a trend, constant and a dummy for the Christmas period and outputs a result with 5 lags.
Now when I want to extract its residuals residuals(fitvar) I get a list of 99 numbers per variable.
I'm supposed to use these residuals to generate bootstrap residuals (randomly drawing with replacement from the obtained ones) and use these with the estimated equations to generate new, bootstrapped time series to re-estimate the VAR and IRFs (and in the end obtain SEs for my estimations).
Since I'm supposed to recursively compute the new time series as follows:

shouldn't I get a list of 104 residuals per variable instead of 99? I'm a bit confused with this whole generating process.
Any help is more than appreciated.
In an autoregressive (AR) model, variables are forecast using linear combinations of past values of the variable. Since you have set lag.max = 12, you are allowing VAR to select a model that uses, at most, 12 lagged values as predictors.
Since your model uses 5 lags, VAR cannot fit values to the first 5 observations of your variables. This is because those first 5 observations are being used to fit a value to the 6th observation. Therefore, the number of residuals will be the number of observations minus the AR model order.

multinomial logistic multilevel models in R

Problem: I need to estimate a set of multinomial logistic multilevel models and can’t find an appropriate R package. What is the best R package to estimate such models? STATA 13 recently added this feature to their multilevel mixed-effects models – so the technology to estimate such models seems to be available.
Details: A number of research questions require the estimation of multinomial logistic regression models in which the outcome variable is categorical. For example, biologists might be interested to investigate which type of trees (e.g., pine trees, maple trees, oak trees) are most impacted by acid rain. Market researchers might be interested whether there is a relationship between the age of customers and the frequency of shopping at Target, Safeway, or Walmart. These cases have in common that the outcome variable is categorical (unordered) and multinomial logistic regressions are the preferred method of estimation. In my case, I am investigating differences in types of human migration, with the outcome variable (mig) coded 0=not migrated, 1=internal migration, 2=international migration. Here is a simplified version of my data set:
migDat=data.frame(hhID=1:21,mig=rep(0:2,times=7),age=ceiling(runif(21,15,90)),stateID=rep(letters[1:3],each=7),pollution=rep(c("high","low","moderate"),each=7),stringsAsFactors=F)
hhID mig age stateID pollution
1 1 0 47 a high
2 2 1 53 a high
3 3 2 17 a high
4 4 0 73 a high
5 5 1 24 a high
6 6 2 80 a high
7 7 0 18 a high
8 8 1 33 b low
9 9 2 90 b low
10 10 0 49 b low
11 11 1 42 b low
12 12 2 44 b low
13 13 0 82 b low
14 14 1 70 b low
15 15 2 71 c moderate
16 16 0 18 c moderate
17 17 1 18 c moderate
18 18 2 39 c moderate
19 19 0 35 c moderate
20 20 1 74 c moderate
21 21 2 86 c moderate
My goal is to estimate the impact of age (independent variable) on the odds of (1) migrating internally vs. not migrating, (2) migrating internationally vs. not migrating, (3) migrating internally vs. migrating internationally. An additional complication is that my data operate at different aggregation levels (e.g., pollution operates at the state-level) and I am also interested in predicting the impact of air pollution (pollution) on the odds of embarking on a particular type of movement.
Clunky solutions: One could estimate a set of separate logistic regression models by reducing the data set for each model to only two migration types (e.g., Model 1: only cases coded mig=0 and mig=1; Model 2: only cases coded mig=0 and mig=2; Model 3: only cases coded mig=1 and mig=2). Such a simple multilevel logistic regression model could be estimated with lme4 but this approach is less ideal because it does not appropriately account for the impact of the omitted cases. A second solution would be to run multinomial logistic multilevel models in MLWiN through R using the R2MLwiN package. But since MLWiN is not open source and the generated object difficult to use, I would prefer to avoid this option. Based on a comprehensive internet search there seem to be some demand for such models but I am not aware of a good R package. So it would be great if some experts who have run such models could provide a recommendation and if there are more than one package maybe indicate some advantages/disadvantages. I am sure that such information would be a very helpful resource for multiple R users. Thanks!!
Best,
Raphael
There are generally two ways of fitting a multinomial models of a categorical variable with J groups: (1) Simultaneously estimating J-1 contrasts; (2) Estimating a separate logit model for each contrast.
Produce these two methods the same results? No, but the results are often similar
Which method is better? Simultaneously fitting is more precise (see below for an explanation why)
Why would someone use separate logit models then? (1) the lme4 package has no routine for simultaneously fitting multinomial models and there is no other multilevel R package that could do this. So separate logit models are presently the only practical solution if someone wants to estimate multilevel multinomial models in R. (2) As some powerful statisticians have argued (Begg and Gray, 1984; Allison, 1984, p. 46-47), separate logit models are much more flexible as they permit for the independent specification of the model equation for each contrast.
Is it legitimate to use separate logit models? Yes, with some disclaimers. This method is called the “Begg and Gray Approximation”. Begg and Gray (1984, p. 16) showed that this “individualized method is highly efficient”. However, there is some efficiency loss and the Begg and Gray Approximation produces larger standard errors (Agresti 2002, p. 274). As such, it is more difficult to obtain significant results with this method and the results can be considered conservative. This efficiency loss is smallest when the reference category is large (Begg and Gray, 1984; Agresti 2002). R packages that employ the Begg and Gray Approximation (not multilevel) include mlogitBMA (Sevcikova and Raftery, 2012).
Why is a series of individual logit models imprecise?
In my initial example we have a variable (migration) that can have three values A (no migration), B (internal migration), C (international migration). With only one predictor variable x (age), multinomial models are parameterized as a series of binomial contrasts as follows (Long and Cheng, 2004 p. 277):
Eq. 1: Ln(Pr(B|x)/Pr(A|x)) = b0,B|A + b1,B|A (x)
Eq. 2: Ln(Pr(C|x)/Pr(A|x)) = b0,C|A + b1,C|A (x)
Eq. 3: Ln(Pr(B|x)/Pr(C|x)) = b0,B|C + b1,B|C (x)
For these contrasts the following equations must hold:
Eq. 4: Ln(Pr(B|x)/Pr(A|x)) + Ln(Pr(C|x)/Pr(A|x)) = Ln(Pr(B|x)/Pr(C|x))
Eq. 5: b0,B|A + b0,C|A = b0,B|C
Eq. 6: b1,B|A + b1,C|A = b1,B|C
The problem is that these equations (Eq. 4-6) will in praxis not hold exactly because the coefficients are estimated based on slightly different samples since only cases from the two contrasting groups are used und cases from the third group are omitted. Programs that simultaneously estimate the multinomial contrasts make sure that Eq. 4-6 hold (Long and Cheng, 2004 p. 277). I don’t know exactly how this “simultaneous” model solving works – maybe someone can provide an explanation? Software that do simultaneous fitting of multilevel multinomial models include MLwiN (Steele 2013, p. 4) and STATA (xlmlogit command, Pope, 2014).
References:
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.
Allison, P. D. (1984). Event history analysis. Thousand Oaks, CA: Sage Publications.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11-18.
Long, S. J., & Cheng, S. (2004). Regression models for categorical outcomes. In M. Hardy & A. Bryman (Eds.), Handbook of data analysis (pp. 258-285). London: SAGE Publications, Ltd.
Pope, R. (2014). In the spotlight: Meet Stata's new xlmlogit command. Stata News, 29(2), 2-3.
Sevcikova, H., & Raftery, A. (2012). Estimation of multinomial logit model using the Begg & Gray approximation.
Steele, F. (2013). Module 10: Single-level and multilevel models for nominal responses concepts. Bristol, U.K,: Centre for Multilevel Modelling.
An older question, but I think a viable option has recently emerged is brms, which uses the Bayesian Stan program to actually run the model For example, if you want to run a multinomial logistic regression on the iris data:
b1 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data=iris, family="categorical",
prior=c(set_prior ("normal (0, 8)")))
And to get an ordinal regression -- not appropriate for iris, of course -- you'd switch the family="categorical" to family="acat" (or cratio or sratio, depending on the type of ordinal regression you want) and make sure that the dependent variable is ordered.
Clarification per Raphael's comment: This brm call compiles your formula and arguments into Stan code. Stan compiles it into C++ and uses your system's C++ compiler -- which is required. On a Mac, for example, you may need to install the free Developer Tools to get C++. Not sure about Windows. Linux should have C++ installed by default.)
Clarification per Qaswed's comment: brms easily handles multilevel models as well using the R formula (1 | groupvar) to add a group (random) intercept for a group, (1 + foo | groupvar) to add a random intercept and slope, etc.
I'm puzzled that this technique is descried as "standard" and "equivalent", though it might well be a good practical solution. (Guess I'd better to check out the Allison and Dobson & Barnett references).
For the simple multinomial case ( no clusters, repeated measures etc.) Begg and Gray (1984) propose using k-1 binomial logits against a reference category as an approximation (though a good one) in many cases to full blown multinomial logit. They demonstrate some loss of efficiency when using a single reference category, though it's small for cases where a single high-frequency baseline category is use as the reference.
Agresti (2002: p. 274) provides an example where there is a small increase in standard errors even when the baseline category constitutes over 70% of 219 cases in a five category example.
Maybe it's no big deal, but I don't see how the approximation would get any better adding a second layer of randomness.
References
Agresti, A. (2002). Categorical data analysis. Hoboken NJ: Wiley.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11–18.
I will recommend you to use the package "mlogit"
I am dealing with the same issue and one possible solution I found seems to resort to the poisson (loglinear/count) equivalent of the multinomial logistic model as described in this mailinglist, these nice slides or in Agresti (2013: 353-356). Thus, it should be possible to use the glmer(... family=poisson) function from the package lme4 with some aggregation of the data.
Reference:
Agresti, A. (2013) Categorical data analysis. Hoboken, NJ: Wiley.
Since I had the same problem, I recently came across this question. I found out this package called ordinal having this cumulative link mixed model function (clmm2) that seems similar to the proposed brm function, but using a frequentist approach.
Basically, you would need to set the link function (for instance as logit), you can choose to have nominal variables (meaning, those variables that are not fulfilling the proportional odds assumption), set the threshold to "flexible" if you want to allow having unstructured cut-points, and finally add the argument "random" for specifying any variable that you would like to have with a random effect.
I found also the book Multilevel Modeling Using R, W. Holmes Finch Jocelyn E. Bolin, Ken Kelley and they illustrate how to use the function from page 151, with nice examples.
Here's an implementation (not my own). I'd just work off this code. Plus, this way you'll really know what's going on under the hood.
http://www.nhsilbert.net/docs/rcode/multilevel_multinomial_logistic_regression.R

Resources