Correlation Syntax for R - r

Very basic question, it's my first time writing syntax in R. Trying to write basic correlation syntax. Hypothesis is as follows: X1 (Predictor variable) and X2 (latent predictor variable) will be positively associated with Y (outcome variable), over and above X3 (latent predictor variable). How can I write this in R?

Not sure what your statistics chops are, but pure the correlation as measured by the r-squared value will strictly increase with added variables to your model. So, if these variables are stored in data frame df,
model_full <- lm(Y ~ X1 + X2 + X3, data = df)
fits the full model. Use summary(model_full) to view summary statistics of the model.
model_reduced <- lm(Y ~ X3, data = df)
fits the reduced model. Here's where the more complicated stuff comes in. To test the value of X1 and X2, you probably want an F-test to test whether the coefficients on X1 and X2 are jointly statistically significantly different from zero (this is how I interpret 'above and beyond X3'). To compute that test, use
lmtest::waldtest(model_full, model_reduced, test = "F")
Hope this helps!

Related

Predict dependent/endogenous variables from new dataset using lavaan path model in R

I have constructed a path model with the R package lavaan (regressions only, no latent variables) using a complete dataset (data1) with values for all variables (x's and y's). The R code I used is as follows:
Specification:
model1 <- 'y1 ~ x1 + x2 + x3 + x4 + y2
y2 ~ x1 + x2 + x4 + x5 + x6'
Fit:
fit1 <- sem(model = model1, data = data1, estimator = "MLR")
I now want to predict/estimate values for y1 and y2 using a new dataset (data2) which includes values for all x variables, but not for the y variables. In other words, I want to predict unknown values for the y variables using the known x variable values and my fitted path model estimates, similar to prediction in regular regression.
However, from my research, the lavPredict() function for lavaan is NOT built to do this (the CRAN description for lavPredict() explicitly states: "the goal of this function is NOT to predict future values of dependent variables as in the regression framework!"; also, here). My understanding is that the regular predict() function will also call lavPredict() for lavaan objects. Though it seems like there were plans to update lavPredict() to be able to conduct regression-style predictions (see here), I cannot find any documentation of these suggested updates being implemented.
Is there any way to predict (in a regression sense) values for my dependent variables given values of my independent variables from a new dataset using my lavaan path model (as described above)?
Apologies in advance if I somehow missed the answer to this question elsewhere. Thank you so much!

How to find overall significance for main effects in a dummy interaction using anova()

I run a Cox Regression with two categorical variables (x1 and x2) and their interaction. I need to know the significance of the overall effect of x1, x2 and of the interaction.
The overall effect of the interaction:
I know how do find out the overall effect of the interaction using anova():
library(survival)
fit_x1_x2 <- coxph(Surv(time, death) ~ x1 + x2 , data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_x2, fit_full)
But how are we supposed to use anova() to find out the overall effect of x1 or x2? What I tried is this:
The overall effect of x1
fit_x2_ia <- coxph(Surv(time, death) ~ x2 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x2_ia, fit_full)
The overall effect of x2
fit_x1_ia <- coxph(Surv(time, death) ~ x1 + x1:x2, data= df)
fit_full <- coxph(Surv(time, death) ~ x1 + x2 + x1:x2, data= df)
anova(fit_x1_ia, fit_full)
I am not sure whether this is how we are supposed to use anova(). The fact that the output shows degree of freedom is zero makes me sceptical. I am even more puzzled that both times, for the overall effect of x1 and x2, the test is significant, although the log likelihood values of the models are the same and the Chi value is zero.
Here is the data I used
set.seed(1) # make it reproducible
df <- data.frame(x1= rnorm(1000), x2= rnorm(1000)) # generate data
df$death <- rbinom(1000,1, 1/(1+exp(-(1 + 2 * df$x1 + 3 * df$x2 + df$x1 * df$x2)))) # dead or not
library(tidyverse) # for cut_number() function
df$x1 <- cut_number(df$x1, 4); df$x2 <- cut_number(df$x2, 4) # make predictors to groups
df$time <- rnorm(1000); df$time[df$time<0] <- -df$time[df$time<0] # add survival times
The two models you have constructed for "overall effect" do really not appear to satisfy the statistical property of being hierarchical, i.e properly nested. Specifically, if you look at the actual models that get constructed with that code you should see that they are actually the same model with different labels for the two-way crossed effects. In both cases you have 15 estimated coefficients (hence zero degrees of freedom difference) and you will not that the x1 parameter in the full model has the same coefficient as the x2[-3.2532,-0.6843):x1[-0.6973,-0.0347) parameter in the "reduced" model looking for an x1-effect, namely 0.19729. The crossing operator is basically filling in all the missing cells for the main effects with interaction results.
There really is little value in looking at interaction models without all of the main effects if you want to stay within the bounds of generally accepted statistical practice.
If you type:
fit_full
... you should get a summary of the model that has p-values for x1 levels, x2 levels,and the interaction levels. Because you chose to categorize these by four arbitrary cutpoints each you will end up with a total of 15 parameter estimates. If instead you made no cuts and modeled the linear effects and the linear-by-linear interaction, you could get three p-values directly. I'm guessing there was suspicion that the effects were not linear and if so I thought a cubic spline model might be more parsimonious and distort the biological reality less than discretization into 4 disjoint levels. If you thought the effects might be non-linear but ordinal, there is an ordinal version of factor classed variables, but the results are generally confusion to the uninitiated.
The answer from 42- is informative but after reading it I still did not know how to determine the three p values or if this is possible at all. Thus I talked to the professor of biostatistic of my university. His answer was quite simple and I share it in case others have similar questions.
In this design it is not possible to determine the three p values for overall effect of x1, x2 and their interaction. If we want to know the p values of the three overall effects, we need to keep the continuous variables as they are. But breaking up the variables into groups answers a different question, hence we can not test the hypothesis of the overall effects no matter which statisstical model we use.

Finding assigned importance to variable inside Prophet model?

I am building datasets and training unique models for combinations of x1, x2, x3. Think:
prophet1 <- fit.prophet(data.frame(ds, y, x1))
prophet2 <- fit.prophet(data.frame(ds, y, x2, x3))
prophet3 <- fit.prophet(data.frame(ds, y, x3))
I am then setting x1, x2, x3 to zero for each of the models, and evaluating its effect on y had that variable not been introduced. My question is- there any way to tell from the model object whether x1 in prophet1 contributed more than x2+x3 in prophet2 without explictly predicting the dataframe? i.e.- can we tell whether setting x1 to zero changes y more than x2+x3 to zero does by just looking at the generated model? Does x1 have a higher regression coefficient than x2+x3 and as such- change y more?
I was digging around and found this:
model$param$k; // Base trend growth rate
model$param$m; // Trend offset
model$param$sigma_obs; // Observation noise
model$param$beta; // Regressor coefficients
Source: https://github.com/facebook/prophet/issues/501
If I were to place x1, x2, and x3 in the same dataframe and evaluate y, I can evaluate this coefficient by looking at the beta values. However- I don't know how to find this out if they are in seperate dataframes across different models.
But plotting the sum(beta), k, m, or sigma_obs against difference between y and predictions had the variable set to zero did not yield me any relationship at all. Is it possible to extract out how important the variables used to model y from a prophet model are/ whether Prophet believes the effect is positive/negative? If so; how can I do so?

R - multiple linear regression, generating an extra variable with a specific regression coefficient

I want to run a simulated power analysis for a dataset I have. Let's assume the dataset has four variables (column names of the dataset):
Y - which is the dependent variable, is continuous and normally distributed.
X1 - an independent variable, is continuous and has a normal distribution.
X2 - an independent variable, is continous and is NOT normally distributed.
X3 - an independent variable, is continous and is NOT normally distributed.
Now, this data consists of 5000 rows, so there are 5000 entries.
I've run a linear regression using the following formula:
summary(lm( Y ~ X1 + X2 + X3)), and determined the regression coefficients of X1, X2, and X3 to be B1, B2, and B3 respectively.
I now have a fifth variable (x4) which I don't have access to but I believe is normally distributed. Now, the linear model can be updated using the following formula:
lm(Y ~ X1 + X2 + X3 + X4), with the regression coefficient of B4.
I don't know what B4 is, but I have various scenarious where B4 is between 0.2 - 0.5.
I want to run monte-carlo simulations to check what sample size is required to achieve 80% power at various B4. To do this, I need to generate a normally distributed variable that can simulate x4, and and has a regression coefficient of B4. Is there any way to generate this in R?

Regression from error term to dependent variable (lavaan)

I want to test a structural equation model (SEM). There are 3 indicators, I1 to I3, that make up a latent construct LC. This construct should explain a dependent variable DV.
Now, assume that unique variance of the indicators will contribute additional explanation to the DV. Something like this:
IV1 ↖
IV2 ← LC → DV
IV3 ↙ ↑
↑ │
e3 ───────┘
In lavaan the error terms/residuals of IV3, e3, are usually not written:
model = '
# latent variables
LV =~ IV1 + IV2 + IV3
# regression
DV ~ LV
'
Further, the residual of I3 must be split into a compontent that contributes to explain DV, and one residual of the residual.
I do not want to explain DV directly by IV3, because its my goal to show how much unique explanation IV3 can contribute to DV. I want to maximize the path IV3 → LC → DV, and then put the residual into I3 → DV.
Question:
How do I put this down in a SEM?
Bonus question:
Does it make sense from a SEM persective that each of the IVs has such a path to DV?
Side note:
What I already did, was to compute this traditionally, using a series of computations. I:
Computed a pendant to LV, average of IV1 to IV3
Did 3 regressions IVx → LC
Did a multiple regression of the IVxs residuals to DV.
Removing the common variance seems to make one of the residuals superfluous, so the regression model cannot estimate each of the residuals, but skips the last one.
For your question:
How do I put this down in a SEM model? Is it possible at all?
The answer, I think, is yes--at least if I understand you correctly.
If what you want to do is predict an outcome using a latent variable and the unique variance of one of its indicators, this can be easily accomplished in lavaan. See example code below: the first example involves predicting an outcome from a latent variable alone, whereas the second example predicts the same outcome from the same latent variable as well as the unique variance of one of the indicators of that latent variable:
#Call lavaan and use HolzingerSwineford1939 data set
library(lavaan)
dat = HolzingerSwineford1939
#Model 1: x4 predicted by lv (visual)
model1 = '
visual =~ x1 + x2 + x3
x4 ~ visual
'
#Fit model 1 and get fit measures and r-squared estimates
fit1 <- cfa(model1, data = dat, std.lv = T)
summary(fit1, fit.measures = TRUE, rsquare=T)
#Model 2: x4 predicted by lv (visual) and residual of x3
model2 = '
visual =~ x1 + x2 + x3
x4 ~ visual + x3
'
#Fit model 2 and get fit measures and r-squared estimates
fit2 <- cfa(model2, data = dat, std.lv = T)
summary(fit2, fit.measures = TRUE,rsquare=T)
Notice that the R-squared for x4 (the hypothetical outcome) is much larger when predicted by both the latent variable onto which x3 loads, and x3's unique variance.
As for your second question:
Bonus question: Does that make sense? And even more: Does it make sense from a SEM view (theoretically is does) that each of the independet variables has such a path to DV?
It can make sense, in some cases, to specify such paths, but I would not do so in absentia of strong theory. For example, perhaps you think a variable is a weak, but theoretically important indicator of a greater latent variable--such as the experience of "awe" is for "positive affect". But perhaps your investigation isn't interested in the latent variable, per se--you are interested in the unique effects of awe for predicting something above and beyond its manifestation as a form of positive affect. You might therefore specify a regression pathway from the unique variance of awe to the outcome, in addition to the pathway from positive affect to the outcome.
But could/should you do this for each of your variables? Well, no, you couldn't. As you can see, this particular case only has one remaining degree of freedom, so the model is on the verge of being under-identified (and would be, if you specified the remaining two possible paths from the unique variances of x1 and x2 to the outcome of x4).
Moreover, I think many would be skeptical of your motivation for attempting to specify all these pathways. Modelling the pathway from the latent variable to the outcome allows you to speak to a broader process; what would you learn by modelling each and every pathway from unique variance to outcome? Sure, you might be able to say, "Well the remaining "stuff" in this variable predicts x4!"...but what could you say about the nature of that "stuff"--it's just isolated manifest variance. Instead, I think you would be on stronger theoretical ground to consider additional common factors that might underly the remaining variance of your variables (e.g., method factors); this would add more conceptual specificity to your analyses.

Resources