I have this data set that I will used for my model
set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
y = 4 + (1.5*x) + rnorm(100, sd = 2),
b = as.factor(round(abs(DF$x/3))),
c = as.factor(round(abs(DF$y/3)))
)
I was assigned to create a multiplicative model for them with a based 5 like this equation:
y=5*b(i)*c(i)
but the best that I can do is this one:
m1 <- lm(y ~ b*c, data = DF)
summary(m1)
This model is okay but I do want to remove the additive effect and just get the multiplicative model and I also replace the intercept with 5 and create difference coefficient for the first level of b and c.
Is there a way in R to do this task?
To fit the model without a constant use lm(y~b*c -1,...). Setting a fixed constant can be done by specifying the offset and not fitting the constant or by subtracting the known constant from the dependent variable and fitting a model with no constant.
set.seed(123)
x <- rnorm(100)
DF <- as.data.frame(cbind(x))
DF$y = 4 + (1.5*x) + rnorm(100, sd = 2)
DF$b = round(abs(DF$x/3))
DF$c = round(abs(DF$y/3))
DF$bc = DF$b*DF$c
m1 <- lm(y~ b*c, data=DF) # model w/ a constant
m2 <- lm(y~ b*c - 1, data=DF) # model w/o a constant
m3 <- lm(y~ b*c -1 + offset(rep(5,nrow(DF))), data=DF) # model w/ a constant of 5
m4 <- lm(y-5~ b*c -1, data=DF) # subtracting fixed constant from y's
Related
I want to test for a correlation between the random effects of a GLMM model calculated in lme4. I have already been suggested to conduct a likelihood ratio comparison of a model with and without the random correlation. That is indeed significant but I wanted to ask if there is any way to get the confidence intervals or p-values of this correlation from the model.
(Specifically, I have compared a model with the random effects structure (1 + X1 + X2 || group) against (1 + X1 + X2 | group) but the problem is that in the second model also the correlation with the intercept is included and I want to specifically test for the significance of the correlation between X1 and X2. Unfortunately, a model with (1 + X1 | group) + (1 + X2 | group) does not converge)
Any help would be appreciated
You can use confint() to get likelihood profile confidence intervals. P-values would be harder; you could do parametric bootstrapping but it would be slow.
set.seed(101)
dd <- data.frame(x = rnorm(1000), y = rnorm(1000),
g = factor(sample(1:20, size = 1000, replace = TRUE)))
library(lme4)
dd$z <- simulate(~ x + y + (1 + x + y | g),
newdata = dd,
newparams = list(beta = rep(1, 3),
sigma = 1,
theta = rep(1, 6)))[[1]]
m <- lmer(z ~ x + y + (1 + x + y | g),
data = dd)
In the confint() call below, parm = "theta_" means "all covariance parameters". You could use parm = c(2, 3, 5) to select only the correlation parameters, but you'd have to read ?profile.merMod and think carefully to figure out the correct indices ...
cc <- confint(m, parm = "theta_", oldNames = FALSE)
Results give you 95% (by default) CIs for all of the covariance parameters. In this example, the x/y slope correlation is significant but the correlations between (intercept and x) and (intercept and y) aren't. (Note that the correlations aren't necessarily invariant to reparameterizing the model, in particular centering or otherwise shifting the predictors will change the answers ...)
cc
2.5 % 97.5 %
sd_(Intercept)|g 0.38142602 0.7451417
cor_x.(Intercept)|g -0.15990774 0.6492967
cor_y.(Intercept)|g -0.01148283 0.7294138
sd_x|g 0.67205037 1.2800681
cor_y.x|g 0.53404483 0.9116571
sd_y|g 0.83378353 1.5742580
sigma 0.94201110 1.0311559
I am currently using the lavaan package in R for structural equation models. I would like to compute the effect sizes (i.e., partial-eta-squared) for each of my path coefficient. Is there already a package that does this?
For instance, how can I compute the effect size of the c, a and b regression coefficients?
set.seed(1234)
X <- rnorm(100)
M <- 0.5*X + rnorm(100)
Y <- 0.7*M + rnorm(100)
Data <- data.frame(X = X, Y = Y, M = M)
model <- ' # direct effect
Y ~ c*X
# mediator
M ~ a*X
Y ~ b*M
# indirect effect (a*b)
ab := a*b
# total effect
total := c + (a*b)
'
fit <- sem(model, data = Data)
summary(fit)
Ideally the method should also work when building models based on latent variables.
How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...
I am applying multiple ML algorithm to this dataset so I tried logistic regression and I plotted the predictions and it seems completely off since the plot only shows data points from one class. Here is the data and what I attempted
set.seed(10)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- ifelse(x1 ^ 2 - x2 ^ 2 > 0, 1, 0)
dat <- data.frame(x1, x2, y)
#Logistic Regression
fit.glm <- glm(y ~ x1 + x2, data = dat, family = "binomial")
y.hat.3 <- predict(fit.glm,dat)
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])
predict returns log-odds for a logistic regression by default. To get predicted classes, use type = "resp" to get predicted probabilities and then use a decision rule like p > 0.5 to turn them into classes:
y.hat.3 <- predict(fit.glm,dat, type = "resp") > 0.5
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])
I want to estimate a structural equation model using lavaan in R with a categorical mediator. A wrinkle is that three of the exogenous variables are linearly dependent. However, this shouldn't be a problem since I'm using the categorical mediator to achieve identification a la Judea Pearl's front-door criterion. That is, mathematically each particular equation is identified (see the R code below).
With lavaan in R I can obtain estimates when the mediator is numeric, but not when it is categorical. With a categorical mediator I obtain the following error:
Error in lav_samplestats_step1(Y = Data, ov.names = ov.names, ov.types = ov.types,
: lavaan ERROR: linear regression failed for y; X may not be of full rank in group 1
Any advice on how to obtain estimates with a categorical mediator using lavaan?
Code:
# simulating the dataset
set.seed(1234) # seed for replication
x1 <- rep(seq(1:4), 100) # variable 1
x2 <- rep(1:4, each=100) # variable 2
x3 <- x2 - x1 + 4 # linear dependence
m <- sample(0:1, size = 400, replace = TRUE) # mediator
df <- data.frame(cbind(x1,x2,x3,m)) # dataframe
df$y <- 6.5 + x1*(0.5) + x2*(0.2) + m*(-0.4) + x3*(-1) + rnorm(400, 0, 1) # outcome
# structural equation model using pearl's front-door criterion
sem.formula <- 'y ~ 1 + x1 + x2 + m
m ~ 1 + x3'
# continuous mediator: works!
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500))
# categorical mediator: doesn't work
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500),
ordered = "m")