R: What is the difference between the "baseline model" and "unrestricted model" in lavaan model summaries? - r

Usingsummary(fit.measures = TRUE) I am able to access extensive information about the fit of models stored in lavaan model objects. In this output (exemplified in the accompanying image), several lines compare the user's specified model to two alternative models:
"Baseline Model"
"Unrestricted Model"
I am looking for a somewhat precise explanation of the models implied by each of these terms, since they can mean different things within the structural equation modeling community. Ideally, I would be able to extract the model itself that implied by this term after e.g. using lavaan::cfa().
Currently, the tutorial does not provide any explanation, while the package documentation states the baseline model is "the independence model, assuming all variables are uncorrelated." However, it is not clear what is meant by "all variables" and the example it provides of an independence model on p.79 assumes exogenous various are correlated due to the default settings in lavaan.
Similarly, p.34 of the documentation does not explain what is meant by a "variable" when it notes:
"...the model is defined as the unrestricted model. The following free
parameters are included: all covariances/correlations among the
variables, variances for continuous variables, means for continuous
variables, thresholds for ordered variables, and if exogenous
variables are included (ov.names.x is not empty) while some variables
are ordered, also the regression slopes enter the model"

Not sure this is an appropriate post because it is not about programming. The answers can be found in introductory SEM textbooks.
the independence model, assuming all variables are uncorrelated." However, it is not clear what is meant by "all variables"
All endogenous (or modeled, explained) variables are uncorrelated in the independence model. Exogenous variables are not explained by the model, but instead are taken as given (like in OLS regression). The independence model is used to calculate incremental fit indices (e.g., CFI and TLI; see Bentler & Bonett, 1980, regarding their basic structure). The default baseline.model is independence, but you can fit whatever custom model you want to use (Widamin & Thompson, 2003), which should be nested within your target model(s), and pass it to fitMeasures().
The unrestricted model is on the opposite end of the continuum. Your target model(s) are nested within it because it is saturated, which reproduces your observed means and covariance matrix better than any restricted model could. Thus, it serves as the reference model for the likelihood ratio test of exact model fit (the chi-squared test statistic under Model Test User Model:).
References
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88(3), 588–606. https://doi.org/10.1037/0033-2909.88.3.588
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8(1), 16–37. https://doi.org/10.1037/1082-989X.8.1.16

Related

Manually estimate probit model with autoregressive structure in R

I would like to write an R algorithm which would perform the Maximum Likelihood estimation of a binary choice model (probit/logit, it does not really matter) with the following structure for the latent variable:
I understand the logics provided in this answer, however, I do not understand how to account for the presence of the lagged value of latent variable.

R: Using relative importance (relaimpo package) to build a linear model for prediction?

I have a huge dataset and I'm trying to build a good predictive linear model using the relaimpo package.
Using the calc.relimp function with type="lmg, i get an output of variables which are of relative importance. Although the proportion of variance explained by the model is only at 52%, I want to go and build a linear model using these variables.
Is there a way to build a lm model using these variables and somehow take into account the relative importance values into the model?
I'm not too familiar with this and was thinking maybe something along the lines of weighting each variable based on its relative importance value...?
I'm not a statistician, so I won't give you any Greek symbols, but I think you are confusing a few things.
As you correctly say, the relative importances based on the LMG method are more or less some sort of variance decomposition in case of correlated predictor variables, i.e. it tells you how much of your variance in the model is explained by which predictor.
However, this doesn't have to do anything with the lm function and its estimation itself. In fact, the R² of your lm model is exactly the same as you'll get by summing up the relative importances from calc.relimp.
There is no way to tell the lm function to pay more attention to a certain predictor during prediction/estimation.
What you probably want to do is an elastic net (which is a combination of LASSO and RIDGE regression), which basically does what you want, i.e. it shrinks the impact of "unimportant"/small predictors and emphasizes the impact of important/large predictors: https://en.wikipedia.org/wiki/Elastic_net_regularization (Lasso and Ridge regression are linked in the Wikipedia article).
I think this one here is the original package from Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al.: https://cran.r-project.org/web/packages/glmnet/index.html

R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model

I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod) 
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

Quantifying importance of variables in Canonical Correspondence Analysis using R? (x-post from researchgate)

I currently have species abundance data for multiple lakes along with measurements of some environmental variables of those lakes. I decided to do Canonical Correspondence Analysis of the data in R, as demonstrated by ter Braak and Verdenschot (1995), see link: http://link.springer.com/article/10.1007%2FBF00877430 (section: "Ranking environmental variables in importance")
I am not very good with R yet, and I do not have access to the software specified in the article (CANOCO). My problem is, in order to do stepwise ranking of importance of environmental variables, I have to obtain the Lambda (Is this the same as Wilk's Lambda?) and perform a Monte Carlo permutation test on each CCA constrained axis. 
Does anybody know how I can do this in R? I would love to be able to use this analysis.
You want the anova() method that vegan provides for cca(), the function that does CCA in the package, if you want to test effects in a current model. See ?anova.cca for details and perhaps the by = "margin" option to test marginal terms.
To do stepwise selection you have two options
Use the standard step() function that works with an AIC-like statistic for CCA, or
For the sort of selection done in that paper and implemented in CANOCO you want ordistep(). This does forward selection & backward elimination testing changes to models via permutation tests.
Lambda is often used to indicate Eigenvalues and is not Wilk's Lambda. The pseudo-F statistic will be mentioned in the paper and it is this that is computed in the test and for which permutations give the sampling distribution under the null hypothesis, which ultimately determines significance of terms in the model or whether a term enters or leaves the model.
See ?ordistep for more details.

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Resources