How to model formative latent variable in R using lavaan, and also add control variables - r

I want to do a structural equation model for my analysis.
All three concepts are formative, which means these measurement variables construct the latent structure. Not reflective.
And also I want to control for demographic variables.
The analysis will use R and I learnt Lavaan in statistics class, just curious whether anyone can give more information on what I want to do.
Thanks ahead.

in order to used formative constructs you need to use partial least square SEM (PLS-SEM) and not covariance based SEM (Cov-SEM).
the underlying assumption of the two technique differs, as PLS assumes that because the constructs are formative there is no error in measurement while CoV-SEM, has the defaults assumption of reflexive measurement and thus takes into account the error terms while analyzing the model
Lavaan package is a COV-SEM one, you can try plspm for using formative items
here there is a guide on how to use the plspm pakage:
http://gastonsanchez.com/PLS_Path_Modeling_with_R.pdf

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

R - How to properly account for structural breaks in Hierarchical Bayesian VAR (BVAR)?

I am interested in using the new bvar package in R to predict a set of endogenous time series. However, because of the COVID pandemic, my time series have been through a structural break. What is the best way to account for this in the model? Some hypotheses:
Add exogenous dummy variable (it seems the package doesn't have this feature)
Add endogenous dummy variable with strong priors that zero the coefficients of impact from other variables over it (i.e. an "artificial" exogenous variable)
Create two separate models (before vs after structural break)
I have tried a mix of 2+3. I tested a (i) model with only recent data (after structural break) and no dummies vs (ii) another with the full history with an additional endogenous (dummy) variable, but without the strong dummy prior (I couldn't understand how to configure it properly). The model (ii) has performed way better in the test set.
I wrote an e-mail to the owner of the package, Nikolas Kuschnig (couldn't find his user in SO), to which he replied:
Structural breaks are always a pain to model. In general it's probably
preferable to estimate two separate models, but given the short timespan and you
getting usable results your idea with adding a dummy variable should also work.
You can adjust priors from other variables by manually setting psi in
bv_mn() (see the docs and the vignette for an explanation).
Depending on the variables you might also be fine not doing any of that, since
COVID could just be seen as another shock (which is almost always quite the
stretch, given the extent of it).
Note that if there is an actual structural break, the dummies won't suffice,
since the coefficients would change (hence my preference for your option 3). To
an extent you could model this with a Markov-switching VAR, but unfortunately I
don't know of an accessible implementation for R.
Thank you, Nikolas

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

How to use the `vcconv` command in lme4 for serial correlation?

I'm working with a large longitudinal dataset of firm-year observations. For some time now I have been using lme4to implement crossed (non-nested) effects for year and firm-ID groups.
My goal is now to correct for the serial correlation in the firm-group dimension. Based on chl's and fabians' answers to this question (as well as Ben Bolker's comment on the latter), I've assumed this is impossible with lmer(), but is feasible with nlme::lme().
I have been able to implement crossed effects in nlme based on the discussion in Pinheiro & Bates (2000, sec. 4.2.2, pp. 163-6). In principle then, I believe I can use the correlation = AR1() speficiation in lme() to control for autocorrelation.
My strong preference, however, would be to implement such a correlation specification in lmer() because:
lme4 is much, much (much) faster
nlme requires crossed effects to be nested in some higher group -- without such a higher level grouping I'm forced to create an arbitrary dummy for groupedData to which all observations belong (e.g., here). This creates issues interpreting the relative levels of variation between the two crossed groups and the residual variance because some of the variation appears to be captured by the higher-level dummy group.
I got excited when I found the feature request #224 on GitHub, but alas it doesn't seem like there's much movement on the flexLambda front (please let me know if I'm wrong!).
lme4 v1.1-10
I've just noticed that the latest (Oct. 2015) version of lme4 contains a vcconv command that can
Convert between representation of (co-)variance structures (EXPERIMENTAL.)
Based on the source code, it seems that maybe the sdcor2cov option could allow one to specify a correlation structure such as AR(1).
So my questions are:
Is this interpretation of the vcconv function correct?
If so, does the user supply the correlation (e.g., AR(1)) parameters or are they determined internally in lmer()?
How does one implement this function properly?

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Resources