Trouble Converging Bifactor model using lavaan - r

Title basically explains it but I'm trying to build a bifactor model with psychopathy as one factor and subtypes as the other. I believe that I have everything constrained properly but that might be the issue.
Current code:
BifactorModel <- 'psychopathyBi =~ YPIS_1 + YPIS_2 + YPIS_3 + YPIS_4 + YPIS_5 + YPIS_6 + YPIS_7 + YPIS_8 + YPIS_9 +YPIS_10 + YPIS_11 + YPIS_12 + YPIS_13 + YPIS_14 + YPIS_15 + YPIS_16 + YPIS_17 + YPIS_18
GMbi =~ YPIS_4 + YPIS_5 + YPIS_8 + YPIS_9 + YPIS_14 + YPIS_16
CUbi =~ YPIS_3 + YPIS_6 + YPIS_10 + YPIS_15 + YPIS_17 + YPIS_18
DIbi =~ YPIS_1 + YPIS_2 + YPIS_7 + YPIS_11 + YPIS_12 + YPIS_13
psychopathyBi ~~ 0*GMbi
psychopathyBi ~~ 0*CUbi
psychopathyBi ~~ 0*DIbi
GMbi ~~ 0*CUbi
GMbi ~~ 0*DIbi
CUbi ~~ 0*DIbi
'
#fit bifactor model
bifactorFit <- cfa(BifactorModel, data = YPIS_Data)
#get summary of bifactor model
summary(bifactorFit, fit.measures = TRUE, standardized = TRUE)
This produces the following:
lavaan 0.6-9 did NOT end normally after 862 iterations
this is what the model should ultimately look like once converged
Any suggestions or comments would be greatly appreciated. Thanks in advance.

The variances of several of your latent variables are very small. For example, Dlbi appears to be effectively zero. That's the source of the issue here.
There are two things you can to try to remedy this.
First, it may work better to identify the model by fixing the latent variable variances to 1, rather than fixing the first indicator factor loadings to 1. Do this by specifying std.lv = TRUE.
Even then, it will likely be the case that loadings onto one or more of the group factors will have very small loadings. This indicates that there really isn't much of a distinct group factor in your data for this items that is distinct from the general factor. You should consider estimating a model that drops that group factor (as well as comparing with models dropping the other group factors one at a time). We discuss this issue some here: https://psyarxiv.com/q356f/
Additionally, you should constrain item loadings so that they are in the theoretically expected direction (e.g., all positive with a lower bound of 0). It is common for bifactor models to overextract variance in items and produce uninterpretable group factors that have a mix of positive and negative loadings. This can also cause convergence issues.
In general, this sort of unconstrained bifactor model tends to be overly flexible and tends to overfit to a similar degree as exploratory factor analysis. You should be sure to evaluate the bifactor model based not only on global model fit statistics, but also on whether the factor loadings actually resemble a true bifactor model--do the items each show substantial loadings on both the general factor and their group factor in the expected directions, or do items tend to load on only one or the other? See some examples in the paper linked above about this issue.
Another option would be to switch to exploratory bifactor modeling. This is implemented in R in the fungible package in the fungible::BiFAD() function. This approach is discussed here:
https://www.sciencedirect.com/science/article/pii/S0001879120300555
Exploratory bifactor models are useful because they rely on targeted EFA rotation to estimate loadings. This makes convergence much more likely and can help to diagnose when a group factor is too weak to identify in the data.

Related

Linear mixed models with missing cells

I am helping another researcher with their coding in R. I did not work with them during the planning of the experiment design and now I could really use some help with this tricky design. I have four fixed factor: FactorA, FactorB, FactorC, and FactorD. The experiment is not a fully factorial design. There are missing cells (combinantions of factors that are not available) in addition to umbalaced number of samples. For the combinations FactorA:FactorB, FactorA:FactorC, and FactorB:FactorC, I have the proper amount of cells (treatment combinations). I also have a random factor: Block, which is nested within FactorD. In my field, it is common for people (even in high impact journals) just to run different ANOVAs for each factor to avoid dealing with this type of problem, but I wonder if I could write a model that comprises all those factors.
Please, could I use something like this?
lmerTest::lmer(Response ~ FactorA + FactorB + FactorC + FactorD +
FactorA:FactorB + FactorA:FactorC + FactorB:FactorC +
(1|FactorD/Block),indexes)
I appreciate any suggestions you may have!
Assuming that what you're missing from the design are some combinations of factor D with the other factors, this is close.
You can express this a little more compactly as
Response ~ (FactorA + FactorB + FactorC)^2 + FactorD + (1|FactorD:Block)
You shouldn't use (1|FactorD/Block), because that will expand to (1|FactorD) + (1|FactorD:Block) and give you a redundant term (FactorD will be specified as both a fixed and a random effect)
Unbalanced numbers of observations don't matter as long as a factor combination is not completely missing/has at least one observation.

Latent class growth modelling in R/flexmix with multinomial outcome variable

How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.
This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:

Having issues in transforming my data for further analysis In R

I have a dataset here:
'''dataset
I want to perform linear and multiple regression.MoralRelationship and SkeletalP are both dependent variables while others are independent. I tried all the various method of Transformation I know but it did not yield any meaningful result from my diagnostic plot
I did this:
lm1<- lm(MoralRelationship ~ RThumb + RTindex + RTmid + RTFourth + RTFifth + Lthumb + Lindex
+ LTMid + LTFourth + LTfifth + BldGRP1 + BlDGR2, data=data)
I did same for SkeletalP
I did adiagnostic plot for both. then Tried to normalize the variables because there is correlation nor linearity. I took square term, log ,Sqrtof all independent variables also,log,1/x but no better output.
I also did
`lm(SkeletalP ~ RThumb + I(RThumb^2), data=data)`
if i will get a better result with one variable.
The independent variables are right skewed except for ANB which is normally distributed.
is there method I can use to transform my data? most importantly, to be uniformly distributed so that i can perform other statistical test.
Your dataset is kind of small. You can try dimensionality reduction like PCA, but I don't think it's appropriate here. It's also harder to interpret.
Have you tried other models? Tuning might help the fit of your regression models (e.g. Lasso/Ridge L1/L2 regulation)

Understanding the output of my Ramsey RESET test

I am new to R, and doing a replication study where I need to check if their regression holds for the classical assumption for OLS. For the specification assumption, I am doing the Ramsey RESET test, here is my code:
simple_model <- lm(deploy ~ loggdppc + natoyears + milspend + lagdeploy + logland + logcoast + lag3terror + logmindist)
resettest(simple_model, power=2, type="regressor", data = natopanel)
Here is my output:
RESET = 2.0719, df1 = 6, df2 = 355, p-value = 0.05586
Since the P-value is (albeit slightly) above 0.05, does this mean that it 'passes' the RAMSEY test? Or is there an issue of missing variables? I still have not gotten quite the hang of these interpretations. This model does not include all their variables, as they are testing for a specific hypothesis.
Thank you for your help!
According to Wikipedia:
"[The intuition of Ramsey RESET] test is that if non-linear combinations of the explanatory variables have any power in explaining the response
variable, the model is misspecified in the sense that the data
generating process might be better approximated by a polynomial or
another non-linear functional form"
It tests whether including higher degree polynomials of your explanatory variables -- in your example 2nd degree due to power=2 -- have any additional explanatory power. In essence, you test whether the 2nd-degree terms of your regressors are jointly significantly different from zero.
Suppose you use 5% as your cut-off for significance. In that case, you (barely) fail to reject the null hypothesis that including the 2nd-degree terms improves the fit over a linear model.

Mixed model with large sample size

I am currently doing a mixed linear model (using the lme function in R), and I have some problems.
My dataset is about damages by brown bears in Slovenia. Slovenia was divided in 1x1km grids, and for each grid I have data of number of damages per year (for 12 consecutive years). This frequency of damages will be my Y variable in the model, and I will test different environmental variables to explain occurrence of damages (e.g. distance to the forest edge, cover of forest etc.).
I put year as a random factor (verified with a likelihood ratio test).
My sample size is big (250 000 cell values), and mainly 0 (only 4000 cases were positive, ranging from 1 to 17 damages in one cell for a year).
Here is my problem. Following Zuur (2009) methods, I try to find the optimal fixed structure for my model. My first model has all the variables, plus some interactions (see below). I'm using a logit link.
f1 <- formula (dam ~ masting + dens*pop_size_index + saturation + exposition
settlements + orchards + crops + meadows + mixed_for + dist_for_out
dist_for_out_a + dist_for_in + dist_for_in_a + for_edge + prop_broadleaves
prop_broadleaves_a + dist_road + dist_village + feed_stat + sup_food
masting*prop_broadleaves)
M1.lme <- lme (f1, random = ~1|year, method="REML", data=d)
But, looking at likelihood ratio test, I can not remove ANY variable. All are significant. However, the model is still very bad (too many variables in, residuals not good looking), and I can definitely not stop there.
So how can I find a better model (i.e. get rid of non significant variables)?
Is this due to my large sample size?
Could this zero inflation possibly be a problem?
I could not find another way of improving my model, that would take this into account.
Any suggestions?

Resources