Structural Equation Model with Linear Dependency (Lavaan) - r

I want to estimate a structural equation model using lavaan in R with a categorical mediator. A wrinkle is that three of the exogenous variables are linearly dependent. However, this shouldn't be a problem since I'm using the categorical mediator to achieve identification a la Judea Pearl's front-door criterion. That is, mathematically each particular equation is identified (see the R code below).
With lavaan in R I can obtain estimates when the mediator is numeric, but not when it is categorical. With a categorical mediator I obtain the following error:
Error in lav_samplestats_step1(Y = Data, ov.names = ov.names, ov.types = ov.types,
: lavaan ERROR: linear regression failed for y; X may not be of full rank in group 1
Any advice on how to obtain estimates with a categorical mediator using lavaan?
Code:
# simulating the dataset
set.seed(1234) # seed for replication
x1 <- rep(seq(1:4), 100) # variable 1
x2 <- rep(1:4, each=100) # variable 2
x3 <- x2 - x1 + 4 # linear dependence
m <- sample(0:1, size = 400, replace = TRUE) # mediator
df <- data.frame(cbind(x1,x2,x3,m)) # dataframe
df$y <- 6.5 + x1*(0.5) + x2*(0.2) + m*(-0.4) + x3*(-1) + rnorm(400, 0, 1) # outcome
# structural equation model using pearl's front-door criterion
sem.formula <- 'y ~ 1 + x1 + x2 + m
m ~ 1 + x3'
# continuous mediator: works!
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500))
# categorical mediator: doesn't work
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500),
ordered = "m")

Related

Producing effects plots within the GLMMadaptive package

I am getting an inscrutable error message while trying to run effects plots on objects created using the GLMMadaptive::mixed_model() and effects::predictorEffect() functions.
Here is an example problem created from toy binary longitudinal data, created using code supplied with one of the vignettes included with the GLMMadaptive package.
# Library Relevant Packages
library(GLMMadaptive)
library(effects)
# Now we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients
D11 <- 0.48 # variance of random intercepts
D22 <- 0.1 # variance of random slopes
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, ]))
# we simulate binary longitudinal data
DF$y <- rbinom(n * K, 1, plogis(eta_y))
Now when we fit a longitudinal logistic regression using the mixed_model function...
# fit the mixed effects logistic regression for y assuming random intercepts and random slopes for the random-effects part.
fm <- mixed_model(fixed = y ~ sex * time,
random = ~ time | id,
data = DF,
family = binomial())
...and try to create an effects plot using the effects::predictorEffect() function...
plot(predictorEffect("time", fm), type = "link")
...we get the following error
Error in mod.matrix %*% scoef : non-conformable arguments
Has anyone encountered this problem before and if so, found a way to solve it?

Implement null distribution for gbm interaction strength

I am trying to determine which interactions in a gbm model are significant using the method described in Friedman and Popescu 2008 https://projecteuclid.org/euclid.aoas/1223908046. My gbm is a classification model with 9 different classes.
I'm struggling with how to translate Section 8.3 into code to run in R.
I think the overall process is to:
Train a version of the model with max.depth = 1
Simulate response data from this model
Train a new model on this data with max.depth the same as the real model
Get interaction strength for this model
Repeat steps 1-4 to create a null distribution of interaction strengths
The part that I am finding most confusing is implementing equations 48 and 49. (You will have to look at the linked article since I can't reproduce them here)
This is what I think I understand but please correct me if I'm wrong:
y_i is a new vector of the response that we will use to train a new model which will provide the null distribution of interaction statistics.
F_A(x_i) is the prediction from a version of the gbm model trained with max.depth = 1
b_i is a probability between 0 and 1 based on the prediction from the additive model F_A(x_i)
Questions
What is subscript i? Is it the number of iterations in the bootstrap?
How is each artificial data set different from the others?
Are we subbing the Pr(b_i = 1) into equation 48?
How can this be done with multinomial classification?
How would one implement this in R? Preferably using the gbm package.
Any ideas or references are welcome!
Overall, the process is an elegant way to neutralise the interaction effects in y by permuting/re-distributing the extra contribution of modelling on the interactions. The extra contribution could be captured by the margins between a full and an additive model.
What is subscript i? Is it the number of iterations in the bootstrap?
It is the index of samples. There are N samples in each iteration.
How is each artificial data set different from the others?
The predictors X are the same across data sets. The response values Y~ are different due to random permutation of margins in equation 47 and random realisation (for categorical outcomes only) in equation 48.
Are we subbing the Pr(b_i = 1) into equation 48?
Yes, if the outcome Y is binary.
How can this be done with multinomial classification?
One way is to randomly permute margins in the log-odds of each category. Followed by random realisation according to the probability from the additive model.
How would one implement this in R? Preferably using the gbm package.
I tried to implement it in-line with your overall process.
Firstly, a simulated training data set {X1,X2,Y} of size N=200 where Y has three categories (Y1,Y2,Y3) realised by the probabilities determined by X1, X2. The interaction part X1*X2 is in Y1, while the additive parts are in Y2,Y3.
set.seed(1)
N <- 200
X1 <- rnorm(N) # 2 predictors
X2 <- rnorm(N)
#log-odds for 3 categories
Y1 <- 2*X1*X2 + rnorm(N, sd=1/10) # interaction
Y2 <- X1^2 + rnorm(N, sd=1/10) #additive
Y3 <- X2^2 + rnorm(N, sd=1/10) #additive
Y <- rep(NA, N) # Multinomial outcome with 3 categories
for (i in 1:N)
{
prob <- 1 / (1 + exp(-c(Y1[i],Y2[i],Y3[i]))) #logistic regression
Y[i] <- which.max(rmultinom(1, 10000, prob=prob)) #realisation from prob
}
Y <- factor(Y)
levels(Y) <- c('Y1','Y2','Y3')
table(Y)
#Y1 Y2 Y3
#38 75 87
dat = data.frame(Y, X1, X2)
head(dat)
# Y X1 X2
# 2 -0.6264538 0.4094018
# 3 0.1836433 1.6888733
# 3 -0.8356286 1.5865884
# 2 1.5952808 -0.3309078
# 3 0.3295078 -2.2852355
# 3 -0.8204684 2.4976616
Train a full and an additive models with max.depth = 2 and 1 respectively.
library(gbm)
n.trees <- 100
F_full <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=2) # consider interactions
F_additive <- gbm(Y ~ ., data=dat, distribution='multinomial', n.trees=n.trees, cv.folds=3,
interaction.depth=1) # ignore interactions
#use improved prediction as interaction strength
interaction_strength_original <- min(F_additive$cv.error) - min(F_full$cv.error)
> 0.1937891
Simulate response data from this model.
#randomly permute margins (residuals) of log-odds to remove any interaction effects
margin <- predict(F_full, n.trees=gbm.perf(F_full, plot.it=FALSE), type='link')[,,1] -
predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link')[,,1]
margin <- apply(margin, 2, sample) #independent permutation for each category (Y1, Y2, Y3)
Y_art <- rep(NA, N) #response values of an artificial dataset
for (i in 1:N)
{
prob <- predict(F_additive, n.trees=gbm.perf(F_additive, plot.it=FALSE), type='link',
newdata=dat[i,])
prob <- prob + margin[i,] # equation (47)
prob <- 1 / (1 + exp(-prob))
Y_art[i] <- which.max(rmultinom(1, 1000, prob=prob)) #Similar to random realisation in equation (49)
}
Y_art <- factor(Y_art)
levels(Y_art) = c('Y1','Y2','Y3')
table(Y_art)
#Y1 Y2 Y3
#21 88 91
Train a new model on this artificial data with max.depth (2) the same as the real model
F_full_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=2)
F_additive_art = gbm(Y_art ~ ., distribution='multinomial', n.trees=n.trees, cv.folds=3,
data=data.frame(Y_art, X1, X2),
interaction.depth=1)
Get interaction strength for this model
interaction_strength_art = min(F_additive_art$cv.error) - min(F_full_art$cv.error)
> 0.01323959 # much smaller than interaction_strength_original in step 1.
Repeat steps 2-4 to create a null distribution of interaction strengths. As expected, the interaction effects are much lower (-0.0527 to 0.0421) in the neutralised data sets than in the original training data set (0.1938).
interaction_strength_art <- NULL
for (j in 1:10)
{
print(j)
interaction_strength_art <- c(interaction_strength_art, step_2_to_4())
}
summary(interaction_strength_art)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#-0.052648 -0.019415 0.001124 -0.004310 0.012759 0.042058
interaction_strength_original
> 0.1937891

bivariate Probit/logit R : how to find ALL coefficients and marginal effects with the "zeligverse" package

I am running a bivariate logit model in R with the zeligverse package.I want to calculate the impact of my independant variables on P(Y1=1), P(Y2=1), P(Y1=1,Y2=0), P(Y1=1,Y2=1), P(Y1=0,Y2=1), P(Y1=0,Y2=0), P(Y1=1|Y2=0) and all the other conditional probabilities (Y1 and Y2 are my dependant variables. They both equal 0 or 1). I also want all the marginal effects associated with these probabilities for each independant variable.
Do you know how to find those in this package (or in another package if it works better)?
Not sure this is what you are looking for (feel free to mark me down if not). Zelig packages do seem to be a right choice for your specific question.
library(Zelig)
## Let X_i be independent variable
## Assume you are working with a univariate target variable Y where Y \in {0, 1}
set.seed(123)
m <- 100
df <- data.frame(
Y = rbinom(m, 1, 0.5),
X1 = rbinom(m, 1, 0.95),
X2 = rbinom(m, 1, 0.95)
)
## Fit model once:
fit <- zelig(
Y ~ .,
model = "logit",
data = df,
cite = FALSE
)
summary(fit)
## Let's focus on the binomial predictor 2
x.out1 <- setx(fit, X2=1)
## Run estimation based on a posterior distribution:
postFit <- Zelig::sim(fit, x=x.out1)
summary(postFit)
# plot(postFit)

R : Plotting Prediction Results for a multiple regression

I want to observe the effect of a treatment variable on my outcome Y. I did a multiple regression: fit <- lm (Y ~ x1 + x2 + x3). x1 is the treatment variable and x2, x3 are the control variables. I used the predict function holding x2 and x3 to their means. I plotted this predict function.
Now I would like to add a line to my plot similar to a simple regression abline but I do not know how to do this.
I think I have to use line(x,y) where y = predict and x is a sequence of values for my variable x1. But R tells me the lengths of y and x differ.
I think you are looking for termplot:
## simulate some data
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
y <- cbind(1,x1,x2,x3) %*% runif(4) + rnorm(100, sd = 0.1)
## fit a model
fit <- lm(y ~ x1 + x2 + x3)
termplot(fit, se = TRUE, terms = "x1")
termplot uses predict.lm(, type = "terms") for term-wise prediction. If a model has intercept (like above), predict.lm will centre each term (What does predict.glm(, type=“terms”) actually do?). In this way, each terms is predicted to be 0 at the mean of the covariate, and the standard error at the mean is 0 (hence the confidence interval intersects the line at the mean).

Removing additive effect in Multiple Regression in R

I have this data set that I will used for my model
set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
y = 4 + (1.5*x) + rnorm(100, sd = 2),
b = as.factor(round(abs(DF$x/3))),
c = as.factor(round(abs(DF$y/3)))
)
I was assigned to create a multiplicative model for them with a based 5 like this equation:
y=5*b(i)*c(i)
but the best that I can do is this one:
m1 <- lm(y ~ b*c, data = DF)
summary(m1)
This model is okay but I do want to remove the additive effect and just get the multiplicative model and I also replace the intercept with 5 and create difference coefficient for the first level of b and c.
Is there a way in R to do this task?
To fit the model without a constant use lm(y~b*c -1,...). Setting a fixed constant can be done by specifying the offset and not fitting the constant or by subtracting the known constant from the dependent variable and fitting a model with no constant.
set.seed(123)
x <- rnorm(100)
DF <- as.data.frame(cbind(x))
DF$y = 4 + (1.5*x) + rnorm(100, sd = 2)
DF$b = round(abs(DF$x/3))
DF$c = round(abs(DF$y/3))
DF$bc = DF$b*DF$c
m1 <- lm(y~ b*c, data=DF) # model w/ a constant
m2 <- lm(y~ b*c - 1, data=DF) # model w/o a constant
m3 <- lm(y~ b*c -1 + offset(rep(5,nrow(DF))), data=DF) # model w/ a constant of 5
m4 <- lm(y-5~ b*c -1, data=DF) # subtracting fixed constant from y's

Resources