Regression from error term to dependent variable (lavaan) - r

I want to test a structural equation model (SEM). There are 3 indicators, I1 to I3, that make up a latent construct LC. This construct should explain a dependent variable DV.
Now, assume that unique variance of the indicators will contribute additional explanation to the DV. Something like this:
IV1 ↖
IV2 ← LC → DV
IV3 ↙ ↑
↑ │
e3 ───────┘
In lavaan the error terms/residuals of IV3, e3, are usually not written:
model = '
# latent variables
LV =~ IV1 + IV2 + IV3
# regression
DV ~ LV
'
Further, the residual of I3 must be split into a compontent that contributes to explain DV, and one residual of the residual.
I do not want to explain DV directly by IV3, because its my goal to show how much unique explanation IV3 can contribute to DV. I want to maximize the path IV3 → LC → DV, and then put the residual into I3 → DV.
Question:
How do I put this down in a SEM?
Bonus question:
Does it make sense from a SEM persective that each of the IVs has such a path to DV?
Side note:
What I already did, was to compute this traditionally, using a series of computations. I:
Computed a pendant to LV, average of IV1 to IV3
Did 3 regressions IVx → LC
Did a multiple regression of the IVxs residuals to DV.
Removing the common variance seems to make one of the residuals superfluous, so the regression model cannot estimate each of the residuals, but skips the last one.

For your question:
How do I put this down in a SEM model? Is it possible at all?
The answer, I think, is yes--at least if I understand you correctly.
If what you want to do is predict an outcome using a latent variable and the unique variance of one of its indicators, this can be easily accomplished in lavaan. See example code below: the first example involves predicting an outcome from a latent variable alone, whereas the second example predicts the same outcome from the same latent variable as well as the unique variance of one of the indicators of that latent variable:
#Call lavaan and use HolzingerSwineford1939 data set
library(lavaan)
dat = HolzingerSwineford1939
#Model 1: x4 predicted by lv (visual)
model1 = '
visual =~ x1 + x2 + x3
x4 ~ visual
'
#Fit model 1 and get fit measures and r-squared estimates
fit1 <- cfa(model1, data = dat, std.lv = T)
summary(fit1, fit.measures = TRUE, rsquare=T)
#Model 2: x4 predicted by lv (visual) and residual of x3
model2 = '
visual =~ x1 + x2 + x3
x4 ~ visual + x3
'
#Fit model 2 and get fit measures and r-squared estimates
fit2 <- cfa(model2, data = dat, std.lv = T)
summary(fit2, fit.measures = TRUE,rsquare=T)
Notice that the R-squared for x4 (the hypothetical outcome) is much larger when predicted by both the latent variable onto which x3 loads, and x3's unique variance.
As for your second question:
Bonus question: Does that make sense? And even more: Does it make sense from a SEM view (theoretically is does) that each of the independet variables has such a path to DV?
It can make sense, in some cases, to specify such paths, but I would not do so in absentia of strong theory. For example, perhaps you think a variable is a weak, but theoretically important indicator of a greater latent variable--such as the experience of "awe" is for "positive affect". But perhaps your investigation isn't interested in the latent variable, per se--you are interested in the unique effects of awe for predicting something above and beyond its manifestation as a form of positive affect. You might therefore specify a regression pathway from the unique variance of awe to the outcome, in addition to the pathway from positive affect to the outcome.
But could/should you do this for each of your variables? Well, no, you couldn't. As you can see, this particular case only has one remaining degree of freedom, so the model is on the verge of being under-identified (and would be, if you specified the remaining two possible paths from the unique variances of x1 and x2 to the outcome of x4).
Moreover, I think many would be skeptical of your motivation for attempting to specify all these pathways. Modelling the pathway from the latent variable to the outcome allows you to speak to a broader process; what would you learn by modelling each and every pathway from unique variance to outcome? Sure, you might be able to say, "Well the remaining "stuff" in this variable predicts x4!"...but what could you say about the nature of that "stuff"--it's just isolated manifest variance. Instead, I think you would be on stronger theoretical ground to consider additional common factors that might underly the remaining variance of your variables (e.g., method factors); this would add more conceptual specificity to your analyses.

Related

Correlation Syntax for R

Very basic question, it's my first time writing syntax in R. Trying to write basic correlation syntax. Hypothesis is as follows: X1 (Predictor variable) and X2 (latent predictor variable) will be positively associated with Y (outcome variable), over and above X3 (latent predictor variable). How can I write this in R?
Not sure what your statistics chops are, but pure the correlation as measured by the r-squared value will strictly increase with added variables to your model. So, if these variables are stored in data frame df,
model_full <- lm(Y ~ X1 + X2 + X3, data = df)
fits the full model. Use summary(model_full) to view summary statistics of the model.
model_reduced <- lm(Y ~ X3, data = df)
fits the reduced model. Here's where the more complicated stuff comes in. To test the value of X1 and X2, you probably want an F-test to test whether the coefficients on X1 and X2 are jointly statistically significantly different from zero (this is how I interpret 'above and beyond X3'). To compute that test, use
lmtest::waldtest(model_full, model_reduced, test = "F")
Hope this helps!

Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

In lm(y ~ x1 + x2+ x3 +...+ xn) , not all independent variables are positive.
For example, we know that x1 to x5 must have positive coefficients and x6 to x10 must have negative coefficients.
However, when lm(y ~ x1 + x2+ x3 +...+ x10) is performed using R, some of x1 ~ x5 have negative coefficients and some of x6 ~ x10 have positive coefficients. is the data analysis result.
I want to control this using a linear regression method, is there any good way?
The sign of a coefficient may change depending upon its correlation with other coefficients. As #TarJae noted, this seems like an example of (or counterpart to?) Simpson's Paradox, which describes cases where the sign of a correlation might reverse depending on if we condition on another variable.
Here's a concrete example in which I've made two independent variables, x1 and x2, which are both highly correlated to y, but when they are combined the coefficient for x2 reverses sign:
# specially chosen seed; most seeds' result isn't as dramatic
set.seed(410)
df1 <- data.frame(y = 1:10,
x1 = rnorm(10, 1:10),
x2 = rnorm(10, 1:10))
lm(y ~ ., df1)
Call:
lm(formula = y ~ ., data = df1)
Coefficients:
(Intercept) x1 x2
-0.2634 1.3990 -0.4792
This result is not incorrect, but arises here (I think) because the prediction errors from x1 happen to be correlated with the prediction errors from x2, such that a better prediction is created by subtracting some of x2.
EDIT, additional analysis:
The more independent series you have, the more likely you are to see this phenomenon arise. For my example with just two series, only 2.4% of the integer seeds from 1 to 1000 produce this phenomenon, where one of the series produces a negative regression coefficient. This increases to 16% with three series, 64% of the time with five series, and 99.9% of the time with 10 series.
Constraints
Possibilities include using:
nls with algorithm = "port" in which case upper and lower bounds can be specified.
nnnpls in the nnls package which supports upper and lower 0 bounds or use nnls in the same package if all coefficients should be non-negative.
bvls (bounded value least squares) in the bvls package and specify the bounds.
there is an example of performing non-negative least squares in the vignette of the CVXR package.
reformulate it as a quadratic programming problem (see Wikipedia for the formulation) and use quadprog package.
nnls in the limSolve package. Negate the columns that should have negative coefficients to convert it to a non-negative least squares problem.
These packages mostly do not have a formula interface but instead require that a model matrix and dependent variable be passed as separate arguments. If df is a data frame containing the data and if the first column is the dependent variable then the model matrix can be calculated using:
A <- model.matrix(~., df[-1])
and the dependent variable is
df[[1]]
Penalties
Another approach is to add a penalty to the least squares objective function, i.e. the objective function becomes the sum of the squares of the residuals plus one or more additional terms that are functions of the coefficients and tuning parameters. Although doing this does not impose any hard constraints to guarantee the desired signs it may result in the correct signs anyways. This is particularly useful if the problem is ill conditioned or if there are more predictors than observations.
linearRidge in the ridge package will minimize the sum of the square of the residuals plus a penalty equal to lambda times the sum of the squares of the coefficients. lambda is a scalar tuning parameter which the software can automatically determine. It reduces to least squares when lambda is 0. The software has a formula method which along with the automatic tuning makes it particularly easy to use.
glmnet adds penalty terms containing two tuning parameters. It includes least squares and ridge regression as a special cases. It also supports bounds on the coefficients. There are facilities to automatically set the two tuning parameters but it does not have a formula method and the procedure is not as straight forward as in the ridge package. Read the vignettes that come with it for more information.
1- one way is to define an optimization program and minimize the mean square error by constraints and limits. (nlminb, optim, etc.)
2- Another one is using a library called "lavaan" as follow:
https://stats.stackexchange.com/questions/96245/linear-regression-with-upper-and-or-lower-limits-in-r

How to find overall significance for main effects in a dummy interaction using anova()

I run a Cox Regression with two categorical variables (x1 and x2) and their interaction. I need to know the significance of the overall effect of x1, x2 and of the interaction.
The overall effect of the interaction:
I know how do find out the overall effect of the interaction using anova():
library(survival)
fit_x1_x2 <- coxph(Surv(time, death) ~ x1 + x2 , data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_x2, fit_full)
But how are we supposed to use anova() to find out the overall effect of x1 or x2? What I tried is this:
The overall effect of x1
fit_x2_ia <- coxph(Surv(time, death) ~ x2 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x2_ia, fit_full)
The overall effect of x2
fit_x1_ia <- coxph(Surv(time, death) ~ x1 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_ia, fit_full)
I am not sure whether this is how we are supposed to use anova(). The fact that the output shows degree of freedom is zero makes me sceptical. I am even more puzzled that both times, for the overall effect of x1 and x2, the test is significant, although the log likelihood values of the models are the same and the Chi value is zero.
Here is the data I used
set.seed(1) # make it reproducible
df <- data.frame(x1= rnorm(1000), x2= rnorm(1000)) # generate data
df$death <- rbinom(1000,1, 1/(1+exp(-(1 + 2 * df$x1 + 3 * df$x2 + df$x1 * df$x2)))) # dead or not
library(tidyverse) # for cut_number() function
df$x1 <- cut_number(df$x1, 4); df$x2 <- cut_number(df$x2, 4) # make predictors to groups
df$time <- rnorm(1000); df$time[df$time<0] <- -df$time[df$time<0] # add survival times
The two models you have constructed for "overall effect" do really not appear to satisfy the statistical property of being hierarchical, i.e properly nested. Specifically, if you look at the actual models that get constructed with that code you should see that they are actually the same model with different labels for the two-way crossed effects. In both cases you have 15 estimated coefficients (hence zero degrees of freedom difference) and you will not that the x1 parameter in the full model has the same coefficient as the x2[-3.2532,-0.6843):x1[-0.6973,-0.0347) parameter in the "reduced" model looking for an x1-effect, namely 0.19729. The crossing operator is basically filling in all the missing cells for the main effects with interaction results.
There really is little value in looking at interaction models without all of the main effects if you want to stay within the bounds of generally accepted statistical practice.
If you type:
fit_full
... you should get a summary of the model that has p-values for x1 levels, x2 levels,and the interaction levels. Because you chose to categorize these by four arbitrary cutpoints each you will end up with a total of 15 parameter estimates. If instead you made no cuts and modeled the linear effects and the linear-by-linear interaction, you could get three p-values directly. I'm guessing there was suspicion that the effects were not linear and if so I thought a cubic spline model might be more parsimonious and distort the biological reality less than discretization into 4 disjoint levels. If you thought the effects might be non-linear but ordinal, there is an ordinal version of factor classed variables, but the results are generally confusion to the uninitiated.
The answer from 42- is informative but after reading it I still did not know how to determine the three p values or if this is possible at all. Thus I talked to the professor of biostatistic of my university. His answer was quite simple and I share it in case others have similar questions.
In this design it is not possible to determine the three p values for overall effect of x1, x2 and their interaction. If we want to know the p values of the three overall effects, we need to keep the continuous variables as they are. But breaking up the variables into groups answers a different question, hence we can not test the hypothesis of the overall effects no matter which statisstical model we use.

Interpretation of main effects when interaction is present in gam

Consider a GAM model with the following structure:
y~gam(s(x1, by=x2) + x2 + s(x3)) where x1 and x2 are continuous variables and x2 is categorical. If I want to know the effect of x1 (in terms of deviance explained), I remove x1 from the model and compare the deviance explained (following this thread), like this:
model1 <- y~gam(s(x1, by=x2) + x2 + s(x3))
model2 <- y~gam(x2 + s(x3))
## deviance explained by x1:
summary(model1)$dev.expl-summary(model2)$dev.expl
But what if I want to know the effect of x2? I am not interested in the effect of x2 on x1; I just want to know the effect of x2 by itself. Could I do this:
model3 <- y~gam(s(x1, by=x2) + s(x3))
## deviance explained by x2:
summary(model1)$dev.expl-summary(model3)$dev.expl
I know that for linear models, if a significant interaction is present, one cannot remove the main effects of the variables in that interaction, even if they are not significant. Does the same apply here, in that I cannot know the effect of x2 on y independently of its effect on x1?
Yes, the same apply here. Whenever there are any interactions involving a variable, you cannot make any affirmation over the effects of this variable.
However, notice that this type of effect you are retrieving from explained deviance, doesn't have the same interpretability as the usual in linear models, where you affirm that a modification of a single unit in x2 represents an increase of beta2 over the mean of y. In fact, they are two different effects. Hence, by removing, only the x2 parameter you can still say that you have an explained deviance increase that is interpretable. The only difference is that the interpretation is in terms of information loss, or uncertainty decrease, which is absolutely fine to do.

GLMERTREE: Prevent clustered observations from being split among 2 terminal nodes

I have dataset where observations are taken repeatedly at 72 different sites. I'm using a lmertree model from the GLMERTREE package with a random intercept, a treatment variable, and many "partitioning" variables to identify whether there are clusters of sites that have different relationships between the treatment and response variables depending on the partitioning variables. In most cases, the model does not split observations from the same site among the different terminal nodes, but in a few cases it does.
Is there some way to constrain the model to ensure that non-independent observations are included in the same terminal node?
The easiest way to do so is to consider only partitioning variables at site level. If these site variables are not constant across the observations at the site, it might make sense to average them for each site. For example, to average a regressor x2 to x2m in data d:
d$x2m <- tapply(d$x2, d$site, mean)[d$site]
If you have additional variables at observation level rather than at site level, it might make sense to include them (a) in the regression part of the formula so that the corresponding coefficients are site-specific in the tree,
or (b) in the random effect part of the formula so that only a global coefficient is estimated. For example, if you have a single observation-level regressor z and two site-level regressors x1 and x2 that you want to use for partitioning, you might consider
## (a)
y ~ treatment + z | site | x1 + x2
## (b)
y ~ treatment | (1 + site) + z | x1 + x2
Finally, we discovered recently that in the case of having cluster-level (aka site-level) covariates with strong random effects it might make sense to initialize the estimation of the model with the random effect and not with the tree. The simple reason is that if we start estimation with the tree, this will then capture the random intercepts through splits in the cluster-level variables. We plan to adapt the package accordingly but haven't done so yet. However, it is easy to do so yourself. You simply start with estimating the null model for only the random intercept:
null_lmm <- lmer(y ~ (1 | site), data = d)
Then you extract the random intercept and include it in your data:
d$ranef <- ranef(null_lmm)$site[[1]][d$site]
And include it as the starting value for the random effect:
tree_lmm <- lmertree(y ~ treatment | site | x1 + x2 + ...,
data = d, ranefstart = d$ranef, ...)
You can try to additionally cluster the covariances at site level by setting cluster = site.

Resources