lavaan - problems with WLSMV estimator - r

I am running a SEM using lavaan that includes 5 latent variables. Also, I have 5 regression equations (Y~...) where outcomes are manifest variables and regressors are a mix of latents and indicators.
When I use maximum likelihood estimation the model runs without problem. But when I switch to WLSMV estimation (adding the argument estimator = "WLSMV") I am finding two problems. The first problem is that the execution becomes extremely slow taking several hours to run a single model, any idea why this is happening and if there is a way to fix it?
The second problem is that when I try to fit multigroup SEMs and start constraining the model I get the following warning:
lavaan WARNING: the optimizer (NLMINB) claimed the model converged,
but not all elements of the gradient are (near) zero;
the optimizer may not have found a local solution
use check.gradient = FALSE to skip this check.
Any idea what this means? what are the implications? is this a problem? how I fix it? should I simply stay with maximum likelihood?
IMPORTANT: when I remove the regressions and keep only the measurement part (the five latent variables) the function execute fast and I stop getting the warning message. Does it mean that WLSMV should not be used when the CFA becomes a SEM?
Thanks in advance!

You have a big model for a small sample, I bet, and particularly small for the DWLS estimator with a mean- and variance-adjusted (MV) chi-squared test statistic... WLSMV. You can try to simplify your model, increase your sample, or use a different estimator, like the "MLR" maximum likelihood estimation with robust (Huber– White) standard errors.
I suggest that you check the chapter by Finney, DiStefano, and Kopp (2016).
S.J. Finney, C. DiStefano, J.P. Kopp. Overview of estimation methods and preconditions for their application with structural equation modeling K. Schweizer, C. DiStefano (Eds.), Principles and methods of test construction: Standards and recent advances, Hogrefe Publishing, Boston, MA, USA (2016), pp. 135-165, 10.1027/00449-000

Related

How should you use scaled weights with the svydesign() function in the survey package in R?

I am using the survey package in R to analyse the "Understanding Society" social survey. The main user guide for the survey specifies (on page 45) that the weights have been scaled to have a mean of 1. When using the svydesign() function, I am passing the weight variable to the weight argument.
In the survey package documentation, under the surveysummary() function, it states:
Note that the design effect will be incorrect if the weights have been rescaled so that they are not reciprocals of sampling probabilities.
Will I therefore get incorrect estimates and/or standard errors when using functions such as svyglm() etc?
This came to my attention because, when using the psrsq() function to get the Pseudo R-Squared of a model, I received the following warning:
Weights appear to be scaled: rsquared may be wrong
Any help would be greatly appreciated! Thanks!
No, you don't need to worry
The warning is only about design effect estimation (which most people don't want to do), and only about without-replacement design effects (DEFF rather than DEFT). Most people don't need to do design-effect estimation, they just need estimates and standard errors. These are fine; there is no problem.
If you want to estimate the design effects, R needs to estimate the standard errors (which is fine) and also estimate what the standard errors would be under simple random sampling without replacement, with the same sample size. That second part is the problem: working out the variance under SRSWoR requires knowing the population size. If you have scaled the weights, R can no longer work out the population size.
If you do need design effects (eg, to do a power calculation for another survey), you can still get the DEFT design effects that compare to simple random sampling with replacement. It's only if you want design effects compared to simple random sampling without replacement that you need to worry about the scaling of weights. Very few people are in that situation.
As a final note surveysummary isn't a function, it's a help page.

My understanding of : How does CV.GLMNET work to choose optimal lambda?

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.
A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

Can I trust a full glmer model that converges ONLY with bobyqa and with contrast sum coding?

I am using R 3.2.0 with lme4 version 1.1.8. to run a mixed effects logistic regression model on some binomial data (coded as 0 and 1) from a psycholinguistic experiment. There are 2 categorical predictors (one with 2 levels and one with 3 levels) and two random terms (participants and items). I am using sum coding for the predictors (i.e. contr.sum..) which gives me the effects and interactions that I am interested in.
I find that the full model (with fixed effects and interactions, plus random intercepts AND slopes for the two random terms) converges ONLY when I specify (optimizer="bobyqa"). If I do not specify the optimizer, the model converges only after simplifying the model drastically. The same thing happens when I use the default treatment coding, even when I specify optimizer="bobyqa".
My first question is why is this happening and can I trust the output of the full model?
My second question is whether this might be due to the fact that my data is not fully balanced, in the sense that my conditions do not have exactly the same number of observations. Are there special precautions one must take when the data is not full balanced? Can one suggest any reading on this particular case?
Many thanks
You should take a look at the ?convergence help page of more recent versions of lme4 (or you can read it here). If the two fits using different optimizers give similar estimated parameters (despite one giving convergence warnings and the other not), and the fits with different contrasts give the same log-likelihood, then you probably have a reasonable fit.
In general lack of balance lowers statistical power and makes fitting more difficult, but mildly to moderate unbalanced data should present no particular problems.

random forest gets worse as number of trees increases

I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.
Please let me know what I am doing wrong with randomForest, and how to fix it.
ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).

Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?

I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function.
But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression
coefficients or other estimated quantities. In principle such standard
errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for
this is that standard errors are not very meaningful for strongly
biased estimates such as arise from penalized estimation methods.
Penalized estimation is a procedure that reduces the variance of
estimators by introducing substantial bias. The bias of each estimator
is therefore a major component of its mean squared error, whereas its
variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is
impossible to obtain a sufficiently precise estimate of the bias. Any
bootstrap-based calculations can only give an assessment of the
variance of the estimates. Reliable estimates of the bias are only
available if reliable unbiased estimates are available, which is
typically not the case in situations in which penalized estimates are
used.
Reporting a standard error of a penalized estimate therefore tells
only part of the story. It can give a mistaken impression of great
precision, completely ignoring the inaccuracy caused by the bias. It
is certainly a mistake to make confidence statements that are only
based on an assessment of the variance of the estimates, such as
bootstrap-based confidence intervals do."
Jelle Goeman, Ph.D. Leiden University, Author of the Penalized package in R.
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those...
I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...

Resources