I would like to write an R algorithm which would perform the Maximum Likelihood estimation of a binary choice model (probit/logit, it does not really matter) with the following structure for the latent variable:
I understand the logics provided in this answer, however, I do not understand how to account for the presence of the lagged value of latent variable.
Related
I`m considering several models such as GLM, GLMM, zero-inflated, and zero-inflated mixed in the count data.
All my work was done in R.
Prior studies confirmed that there is a problem of zero excess and over-dispersion as a consideration in counter data analysis.
So I tried the following tests.
1. zero excess
Voung test was performed using the zero-inflated model and the GLM.
Vuong of the pscl package was used.
ZIP vs. GLM Poisson
ZINB vs. GLM NB
Significant results were obtained from the above two tests (p<0.05).
2. over-dispersion
dispersion test was performed to find out why over-dispersion should be
considered in real data using the Poisson model.
dispersiontest of the AER package was used (Cameron, A.C. and Trivedi 1990).
The above test results in rejection of the null hypothesis (p<0.05)
In addition, it was confirmed that dispersion parameter(1/theta) had a value of about 0.39.
However, I have not yet found a verification method for the reason why random effects should be considered.
My data is traffic accident data according to the year of each road. i.e. it is longitudinal count data.
I was told by a professor of statistics that a mixed model should be used considering road heterogeneity.
Therefore, I constructed GLMM poisson/NB and zero-inflated mixed poisson/NB using random effects by road and confirmed the results.
GLMM used glmer of lme4, and glmmTMB of glmmTMB was used as the zero-inflated mixed model.
I did the Houseman test at first. However, this test compares the fixed-effects model with the random-effects model and was considered inappropriate for the count data (not linear model).
Crucially, when testing the random effect of the mixed model from the count data, no previous study was seen that conducted the Hausmann test.
Therefore, my question is as follows:
1. I would like to know if there is a previous study that identifies the reason for considering ramdom effect in modeling in longitudinal study data.
2. Is there a validation method to verify the significant effects of random effects in the mixed model?
The AIC and BIC comparison has already been carried out.
3. If there is a way, what package does R use? Additionally, how to use it
Usingsummary(fit.measures = TRUE) I am able to access extensive information about the fit of models stored in lavaan model objects. In this output (exemplified in the accompanying image), several lines compare the user's specified model to two alternative models:
"Baseline Model"
"Unrestricted Model"
I am looking for a somewhat precise explanation of the models implied by each of these terms, since they can mean different things within the structural equation modeling community. Ideally, I would be able to extract the model itself that implied by this term after e.g. using lavaan::cfa().
Currently, the tutorial does not provide any explanation, while the package documentation states the baseline model is "the independence model, assuming all variables are uncorrelated." However, it is not clear what is meant by "all variables" and the example it provides of an independence model on p.79 assumes exogenous various are correlated due to the default settings in lavaan.
Similarly, p.34 of the documentation does not explain what is meant by a "variable" when it notes:
"...the model is defined as the unrestricted model. The following free
parameters are included: all covariances/correlations among the
variables, variances for continuous variables, means for continuous
variables, thresholds for ordered variables, and if exogenous
variables are included (ov.names.x is not empty) while some variables
are ordered, also the regression slopes enter the model"
Not sure this is an appropriate post because it is not about programming. The answers can be found in introductory SEM textbooks.
the independence model, assuming all variables are uncorrelated." However, it is not clear what is meant by "all variables"
All endogenous (or modeled, explained) variables are uncorrelated in the independence model. Exogenous variables are not explained by the model, but instead are taken as given (like in OLS regression). The independence model is used to calculate incremental fit indices (e.g., CFI and TLI; see Bentler & Bonett, 1980, regarding their basic structure). The default baseline.model is independence, but you can fit whatever custom model you want to use (Widamin & Thompson, 2003), which should be nested within your target model(s), and pass it to fitMeasures().
The unrestricted model is on the opposite end of the continuum. Your target model(s) are nested within it because it is saturated, which reproduces your observed means and covariance matrix better than any restricted model could. Thus, it serves as the reference model for the likelihood ratio test of exact model fit (the chi-squared test statistic under Model Test User Model:).
References
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88(3), 588–606. https://doi.org/10.1037/0033-2909.88.3.588
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8(1), 16–37. https://doi.org/10.1037/1082-989X.8.1.16
I have a huge dataset and I'm trying to build a good predictive linear model using the relaimpo package.
Using the calc.relimp function with type="lmg, i get an output of variables which are of relative importance. Although the proportion of variance explained by the model is only at 52%, I want to go and build a linear model using these variables.
Is there a way to build a lm model using these variables and somehow take into account the relative importance values into the model?
I'm not too familiar with this and was thinking maybe something along the lines of weighting each variable based on its relative importance value...?
I'm not a statistician, so I won't give you any Greek symbols, but I think you are confusing a few things.
As you correctly say, the relative importances based on the LMG method are more or less some sort of variance decomposition in case of correlated predictor variables, i.e. it tells you how much of your variance in the model is explained by which predictor.
However, this doesn't have to do anything with the lm function and its estimation itself. In fact, the R² of your lm model is exactly the same as you'll get by summing up the relative importances from calc.relimp.
There is no way to tell the lm function to pay more attention to a certain predictor during prediction/estimation.
What you probably want to do is an elastic net (which is a combination of LASSO and RIDGE regression), which basically does what you want, i.e. it shrinks the impact of "unimportant"/small predictors and emphasizes the impact of important/large predictors: https://en.wikipedia.org/wiki/Elastic_net_regularization (Lasso and Ridge regression are linked in the Wikipedia article).
I think this one here is the original package from Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al.: https://cran.r-project.org/web/packages/glmnet/index.html
I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod)
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.
Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.