generating errors with heteroscedasticity - r

I have a question regarding generating errors with heteroscedasticity
This is how my friend told me to do it:
n <- 30
x1 <- rnorm(n,0,1) # 1st predictor
x2 <- rnorm(n,0,1) # 2nd predictor
e <- rnorm(n,0,x1^2) # errors with heteroscedaticity
b1 <- 0.5; b2 <- 0.5
y <- x1*b1+x2*b2+e
For me, e <-rnorm(n,0,x1^2) -- this is autocorrelation rather than heteroscedastic error distribution.
But my friend said this is the correct way to generate errors with heteroscedasticity.
Am I missing something here?
I thought heteroscedasticity occurs when the variance of the error terms differ across observations.
Does e<-rnorm(n,0,x1^2) this syntax generate errors with heteroscedasticity correctly?
If not, could anyone tell me how to generate errors with heteroscedasticity?

This specification does generate a particular (slightly odd) kind of heteroscedasticity. You define heteroscedasticity as "the variance of the error term differ[ing] across observations". Since the value of x1 is different for different observations, and you have chosen your error values with a standard deviation of x1^2, the variances will be different for different observations.
note that
rnorm() specifies variability in terms of the standard deviation rather than the variance
autocorrelation refers to non-independence between (successive) observations. rnorm() chooses independent deviates, so this specification doesn't constitute an autocorrelated sample.

Related

Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

In lm(y ~ x1 + x2+ x3 +...+ xn) , not all independent variables are positive.
For example, we know that x1 to x5 must have positive coefficients and x6 to x10 must have negative coefficients.
However, when lm(y ~ x1 + x2+ x3 +...+ x10) is performed using R, some of x1 ~ x5 have negative coefficients and some of x6 ~ x10 have positive coefficients. is the data analysis result.
I want to control this using a linear regression method, is there any good way?
The sign of a coefficient may change depending upon its correlation with other coefficients. As #TarJae noted, this seems like an example of (or counterpart to?) Simpson's Paradox, which describes cases where the sign of a correlation might reverse depending on if we condition on another variable.
Here's a concrete example in which I've made two independent variables, x1 and x2, which are both highly correlated to y, but when they are combined the coefficient for x2 reverses sign:
# specially chosen seed; most seeds' result isn't as dramatic
set.seed(410)
df1 <- data.frame(y = 1:10,
x1 = rnorm(10, 1:10),
x2 = rnorm(10, 1:10))
lm(y ~ ., df1)
Call:
lm(formula = y ~ ., data = df1)
Coefficients:
(Intercept) x1 x2
-0.2634 1.3990 -0.4792
This result is not incorrect, but arises here (I think) because the prediction errors from x1 happen to be correlated with the prediction errors from x2, such that a better prediction is created by subtracting some of x2.
EDIT, additional analysis:
The more independent series you have, the more likely you are to see this phenomenon arise. For my example with just two series, only 2.4% of the integer seeds from 1 to 1000 produce this phenomenon, where one of the series produces a negative regression coefficient. This increases to 16% with three series, 64% of the time with five series, and 99.9% of the time with 10 series.
Constraints
Possibilities include using:
nls with algorithm = "port" in which case upper and lower bounds can be specified.
nnnpls in the nnls package which supports upper and lower 0 bounds or use nnls in the same package if all coefficients should be non-negative.
bvls (bounded value least squares) in the bvls package and specify the bounds.
there is an example of performing non-negative least squares in the vignette of the CVXR package.
reformulate it as a quadratic programming problem (see Wikipedia for the formulation) and use quadprog package.
nnls in the limSolve package. Negate the columns that should have negative coefficients to convert it to a non-negative least squares problem.
These packages mostly do not have a formula interface but instead require that a model matrix and dependent variable be passed as separate arguments. If df is a data frame containing the data and if the first column is the dependent variable then the model matrix can be calculated using:
A <- model.matrix(~., df[-1])
and the dependent variable is
df[[1]]
Penalties
Another approach is to add a penalty to the least squares objective function, i.e. the objective function becomes the sum of the squares of the residuals plus one or more additional terms that are functions of the coefficients and tuning parameters. Although doing this does not impose any hard constraints to guarantee the desired signs it may result in the correct signs anyways. This is particularly useful if the problem is ill conditioned or if there are more predictors than observations.
linearRidge in the ridge package will minimize the sum of the square of the residuals plus a penalty equal to lambda times the sum of the squares of the coefficients. lambda is a scalar tuning parameter which the software can automatically determine. It reduces to least squares when lambda is 0. The software has a formula method which along with the automatic tuning makes it particularly easy to use.
glmnet adds penalty terms containing two tuning parameters. It includes least squares and ridge regression as a special cases. It also supports bounds on the coefficients. There are facilities to automatically set the two tuning parameters but it does not have a formula method and the procedure is not as straight forward as in the ridge package. Read the vignettes that come with it for more information.
1- one way is to define an optimization program and minimize the mean square error by constraints and limits. (nlminb, optim, etc.)
2- Another one is using a library called "lavaan" as follow:
https://stats.stackexchange.com/questions/96245/linear-regression-with-upper-and-or-lower-limits-in-r

How to find RMSE value? and What is good RMSE value?

I am doing forecasting of electrical power output, I have different sets of data that varies from 200-4000 observations. I have calculated forecasting but I do not know how to calculate RMSE value and R (correlation coefficient) in R. I tried to calculate it on excel and the result for rmse was 0.0078. so I have basically two questions here.
How to calculate RMSE and R value in R?
What is good RMSE value? is 0.007 a good considerable value?
Here are two functions, one to compute the MSE and the second calls the first one and takes the squre root, RMSE.
These functions accept a fitted model, not a data set. For instance the output of lm, glm, and many others.
mse <- function(x, na.rm = TRUE, ...){
e <- resid(x)
mean(e^2, na.rm = TRUE)
}
rmse <- function(x, ...) sqrt(mse(x, ...))
Like I said in a comment to the question, a value is not good on its own, it's good when compared to others obtained from other fitted models.
Root Mean Square Error (RMSE) is the standard deviation of the prediction errors. prediction errors are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.
The formula is:
Where:
f = forecasts (expected values or unknown results),
o = observed values (known results).
The bar above the squared differences is the mean (similar to x̄). The same formula can be written with the following, slightly different, notation:
Where:
Σ = summation (“add up”)
(zfi – Zoi)2 = differences, squared
N = sample size.
You can use which ever method you want as both reflects the same and "R" that you are refering to is pearson coefficient that defines the variance amount in the data
Coming to Question2 a good rmse value is always depends on the upper and lower bound of your rmse and a good value should always be smaller that gives less probe of error

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

Derive standard error of a transformed variable in linear regression

I would like to calculate the standard error of a transformed variable from my linear regression, i.e. divide two variables and get the standard error from this variable.
I use the deltamethod function from the msm package, but fail to get accurate standard errors.
For example:
Simulation of data:
set.seed(123)
nobs = 1000
data <- data.table(
x1 = rnorm(nobs),
x2 = rnorm(nobs),
x3 = rnorm(nobs),
x4 = rnorm(nobs),
y = rnorm(nobs))
Linear regression:
reg2 <- lm(y~x1+x2+x3+x4, data=data)
Get the coef and vcov (Here I need to get rid of the missings, as some coefficients in my real data are NA and I calculate a lot of regressions in loop)
vcov_reg <- vcov(reg2)
coef_reg <- coef(reg2)
coef_reg <- na.omit(coef_reg)
coef_reg <- as.numeric(coef_reg)
Deltamethod, for the the variable x1 divided by x3 (meaning I should use x2 and x4 according to the msm package):
deltamethod(~ x2/x4, coef_reg, vcov_reg)
This gives me a standard error of the transformed variable (x1/x3) of 3.21, while all standard errors from this regression are around 0.03.
Any idea's why/what's wrong here?
Other suggestions to calculate it are also welcome.
There is nothing wrong with the result. In your example your data is centered at 0 so it shouldn't be too surprising that when dividing by the data that you end up with a large variance / standard error.
Note that your estimated coefficient for x3 is -0.017408626 so with a standard error of about 0.03 the CI for this coefficient crosses 0. And that's the thing we're dividing by. Hopefully that gives you some intuition for why the standard error seems to explode. For some evidence that this really is part of the issue consider x1/x2 instead.
> deltamethod(~ x2/x3, coef_reg, vcov_reg)
[1] 0.3752063
Which is much smaller since the estimated coefficient for the denominator is bigger in this case (0.09)
But really there is nothing wrong with your code. It was just your intuition was wrong. Alternative methods to estimate what you want would be to bootstrap or to use a Bayesian regression and look at the posterior distribution of the transformation.

same regression, different statistics (R v. SAS)?

I ran the same probit regression in SAS and R and while my coefficient estimates are (essentially) equivalent, the reported test statistics are different. Specifically, SAS reports test statistics as t-statistics whereas R reports test statistics as z-statistics.
I checked my econometrics text and found (with little elaboration) that it reports probit results in terms of t statistics.
Which statistic is appropriate? And why does R differ from SAS?
Here's my SAS code:
proc qlim data=DavesData;
model y = x1 x2 x3/ discrete(d=probit);
run;
quit;
And here's my R code:
> model.1 <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"))
> summary(model.1)
Just to answer a little bit - it's seriously off topic, question should be closed in fact - but neither the t-statistic nor the z-statistic are meaningful. They're both related though, as Z is just the standard normal distribution and T is an adapted "close-to-normal" distribution that takes into account the fact that your sample is limited to n cases.
Now, both the z and the t statistic provide the significance for the null hypothesis that the respective coefficient is equal to zero. The standard error on the coefficients, used for that test, is based on the residual error. Using the link function, you practically transform your response in such a way that the residuals become normal again, whereas in fact the residuals represent the difference between the observed and the estimated proportion. Due to this transformation, calculation of the degrees of freedom for the T-statistic isn't useful anymore and hence R assumes the standard normal distribution for the test statistic.
Both results are completely equivalent, R will just give slightly sharper p-values. It's a matter of debate, but if you look at proportion difference tests, they're also always done using the standard normal approximation (Z-test).
Which brings me back to the point that neither of these values actually has any meaning. If you want to know whether or not a variable has a significant contribution with a p-value that actually says something, you use a Chi-squared test like the Likelihood Ratio test (LR), Score test or Wald test. R just gives you the standard likelihood ratio, SAS also gives you the other two. But all three tests are essentially equivalent, if they differ seriously it's time to look again at your data.
eg in R :
anova(model.1,test="Chisq")
For SAS : see the examples here for use of contrasts, getting the LR, Score or Wald test

Resources