Estimation the constant in linear regression using Stochastic Gradient Descent - r

I am having trouble estimating the constant (intercept) of a multivariate linear regression model by using stochastic gradient descent optimization method (which means batch size equals to 1 of mini-batch gradient descent). The function in R that I use is:
StochasticGradientDescent <- function(data, alpha, iteration, epsilon){data <- matrix(unlist(data), ncol=ncol(data), byrow=FALSE)
independent.variable<- data[,1:ncol(data)-1]
dependent.variable<- data[,ncol(data)]
#add column of 1s for constant
independent.variable <- cbind(theta0 = 1, independent.variable)
theta_new <- matrix( 0, ncol = ncol(independent.variable))
theta_old <- matrix( 1, ncol = ncol(independent.variable))
#Cost function
CostFunction <- function (independent.variable, dependent.variable, theta){
1/(2*(NROW(dependent.variable))) * sum(((independent.variable %*% t(theta)) - dependent.variable)^2);
}
thetas <- vector( mode = "list", length = iteration )
thetas[[1]] <- theta_new
J <- numeric( length = iteration )
J[1] <- CostFunction(independent.variable, dependent.variable, theta_old )
derivative <- function(independent.variable, dependent.variable, theta){
idx <- sample.int(NROW(independent.variable), 1)
descent <- (t(independent.variable[idx, , drop = FALSE]) %*% ((independent.variable[idx, , drop = FALSE] %*% t(theta)) - dependent.variable[idx, drop = FALSE]))
return( t(descent) )
}
#stopping criterion
step <- 1
while(any(abs(theta_new - theta_old) > epsilon) & step <= iteration )
{
step <- step + 1
# gradient descent
theta_old <- theta_new
theta_new <- theta_old - alpha * derivative(independent.variable, dependent.variable, theta_old)
# record keeping
thetas[[step]] <- theta_new
J[step] <- CostFunction(independent.variable, dependent.variable, theta_new)
}
costs <- data.frame( costs = J )
theta <- data.frame( do.call( rbind, thetas ), row.names = NULL )
return( list( costs = costs, theta = theta))
}
I simulate an artificial data.
x1 <- runif(1000000,1,100);
x2 <- runif(1000000,1,200);
y <- 5+4*x1+3*x2;
QR decomposition of lm package gives this result:
fit <- lm(y ~ x1+x2);
summary(fit)
#
#Call:
# lm(formula = y ~ x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.386e-09 0.000e+00 0.000e+00 0.000e+00 9.484e-10
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.000e+00 2.162e-14 2.313e+14 <2e-16 ***
# x1 4.000e+00 2.821e-16 1.418e+16 <2e-16 ***
# x2 3.000e+00 1.403e-16 2.138e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 8.062e-12 on 999997 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: 1
#F-statistic: 3.292e+32 on 2 and 999997 DF, p-value: < 2.2e-16
My initial values for thetas are 0s. Learning rate is chosen to be 0.00005. Number of iterations is 5000. Stopping criteria which is epsilon here (a user-defined small value) is 0.000001. If the trained parameter’s difference between the two iteration is smaller than this value then the algorithm will stop. The result I get is given below:
data<- data.frame(cbind(x1, x2, y))
results <- StochasticGradientDescent( data = data, alpha = 0.00005, iteration = 5000, epsilon = .000001)
results$theta[ nrow(results$theta), ]
# theta0 V2 V3
#5001 0.2219142 4.04408 2.999861
As you can see estimates of coefficients are very close to actual ones. However, the coefficient estimation for theta0 (intercept/constant) is not even close. Besides, I get these values at the end of the cycle of iterations, which is not good. I cannot converge efficiently. I tried but I really could not figure out why this is the case. Could someone help me, please?

Related

Estimate and 95% CI for planned contrasts in R

Here's my problem: Why do the following procedures (classical version and custom function) for planned contrasts lead to different results for the estimate and the 95%CI? Please note that I copied the custom function from this website.
#classical version
data(mtcars)
#set Helmert contrasts
cyl2<-c(-1,1,0)
cyl1<-c(-1,-1,2)
mtcars$cyl<-factor(mtcars$cyl)
contrasts(mtcars$cyl) <-cbind(c1,c2)
classical<-summary.lm(aov(disp~cyl, mtcars))
#custom function (I want to use it because it includes results for equal AND unequal variances --> if the custom function is correct, results for equal variances should be the same as in the classical example):
oneway <- function(dv, group, contrast, alpha = .05) {
# -- arguments --
# dv: vector of measurements (i.e., dependent variable)
# group: vector that identifies which group the dv measurement came from
# contrast: list of named contrasts
# alpha: alpha level for 1 - alpha confidence level
# -- output --
# computes confidence interval and test statistic for a linear contrast of population means in a between-subjects design
# returns a data.frame object
# estimate (est), standard error (se), t-statistic (z), degrees of freedom (df), two-tailed p-value (p), and lower (lwr) and upper (upr) confidence limits at requested 1 - alpha confidence level
# first line reports test statistics that assume variances are equal
# second line reports test statistics that do not assume variances are equal
# means, standard deviations, and sample sizes
ms <- by(dv, group, mean, na.rm = TRUE)
vars <- by(dv, group, var, na.rm = TRUE)
ns <- by(dv, group, function(x) sum(!is.na(x)))
# convert list of contrasts to a matrix of named contrasts by row
contrast <- matrix(unlist(contrast), nrow = length(contrast), byrow = TRUE, dimnames = list(names(contrast), NULL))
# contrast estimate
est <- contrast %*% ms
# welch test statistic
se_welch <- sqrt(contrast^2 %*% (vars / ns))
t_welch <- est / se_welch
# classic test statistic
mse <- anova(lm(dv ~ factor(group)))$"Mean Sq"[2]
se_classic <- sqrt(mse * (contrast^2 %*% (1 / ns)))
t_classic <- est / se_classic
# if dimensions of contrast are NULL, nummer of contrasts = 1, if not, nummer of contrasts = dimensions of contrast
num_contrast <- ifelse(is.null(dim(contrast)), 1, dim(contrast)[1])
df_welch <- rep(0, num_contrast)
df_classic <- rep(0, num_contrast)
# makes rows of contrasts if contrast dimensions aren't NULL
if(is.null(dim(contrast))) contrast <- t(as.matrix(contrast))
# calculating degrees of freedom for welch and classic
for(i in 1:num_contrast) {
df_classic[i] <- sum(ns) - length(ns)
df_welch[i] <- sum(contrast[i, ]^2 * vars / ns)^2 / sum((contrast[i, ]^2 * vars / ns)^2 / (ns - 1))
}
# p-values
p_welch <- 2 * (1 - pt(abs(t_welch), df_welch))
p_classic <- 2 * (1 - pt(abs(t_classic), df_classic))
# 95% confidence intervals
lwr_welch <- est - se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
upr_welch <- est + se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
lwr_classic <- est - se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
upr_classic <- est + se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
# output
data.frame(contrast = rep(rownames(contrast), times = 2),
equal_var = rep(c("Assumed", "Not Assumed"), each = num_contrast),
est = rep(est, times = 2),
se = c(se_classic, se_welch),
t = c(t_classic, t_welch),
df = c(df_classic, df_welch),
p = c(p_classic, p_welch),
lwr = c(lwr_classic, lwr_welch),
upr = c(upr_classic, upr_welch))
}
#results for mtcars with and without Welch correction:
custom<-(with(mtcars,
oneway(dv = disp, group= cyl, contrast = list (cyl1=c(-1,-1,2), cyl2 =c(-1,1,0)))))
Now results are the same for p and t for the classical and the custom version, as expected (at least when equal_var = Assumed). But why are the estimate and the 95%CIs different?
> custom
contrast equal_var est se t df p lwr upr
1 cyl1 Assumed 417.74935 37.20986 11.226845 29.000000 4.487966e-12 341.64664 493.8521
2 cyl2 Assumed 78.17792 24.96113 3.131986 29.000000 3.945539e-03 27.12667 129.2292
3 cyl1 Not Assumed 417.74935 40.30748 10.364066 18.452900 3.985000e-09 333.21522 502.2835
4 cyl2 Not Assumed 78.17792 17.67543 4.422972 9.224964 1.566927e-03 38.34147 118.0144
> classical
Call:
aov(formula = disp ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-77.300 -30.586 -6.568 20.814 118.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 213.850 9.507 22.494 < 2e-16 ***
cyl1 69.625 6.202 11.227 4.49e-12 ***
cyl2 39.089 12.481 3.132 0.00395 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 51.63 on 29 degrees of freedom
Multiple R-squared: 0.8377, Adjusted R-squared: 0.8265
F-statistic: 74.83 on 2 and 29 DF, p-value: 3.551e-12
PS: This was my best attempt to solve this problem. Alternatively, I would be happy for any ideas on how to get estimates and 95%CIs for Welch-corrected contrasts in R that would not involve relying on custom functions from blogs.

Calculate by hand Fitted Values of an Interaction from a regression output

I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233

OLS Regression in R

I am really struggling with the following problem set using R
.
I want to simulate a data set with one dependent and 20 independent variables that are normally i.i.d.. Each variable should have 100 observations. (I managed to do this part)
(Now the part I am struggling with):
My plan is to conduct automated regressions for all possible combinations of up to 5 regressor using an own coded regression function that simulates the output of summary(lm) that uses a vector y and a matrix or vector x as input (so my.lm(y,x)). And then bringing the results in a suitable data structure.
I would be thankful for every hint!
I doubt the soundness of what you are trying to do but here it goes.
I will make up a dataset, since you have not posted one.
my.lm <- function(x, y, n = 5){
f <- function(inx){
inx_cols <- Combn[inx, ]
inx_cols <- inx_cols[inx_cols != 0]
X <- as.data.frame(x[, inx_cols])
names(X) <- paste0("X", inx_cols)
X <- cbind(X, y)
name_y <- names(X)[length(names(X))]
fmla <- as.formula(paste(name_y, ".", sep = "~"))
tryCatch(lm(fmla, data = X), error = function(e) e)
}
nc_x <- ncol(x)
nr <- sum(choose(nc_x, seq_len(n)))
Combn <- matrix(0, nrow = nr, ncol = n)
first <- 1
for(i in seq_len(n)){
last <- first + choose(nc_x, i) - 1
Combn[first:last, seq_len(i)] <- t(combn(nc_x, i))
first <- last + 1
}
apply(Combn, 1, f)
}
set.seed(6876)
regr <- replicate(20, rnorm(100))
coefs <- sample(-5:5, 20, TRUE)
resp <- regr %*% coefs + rnorm(100)
lm_list <- my.lm(regr, resp)
length(lm_list)
#[1] 21699
So the function above produced as many objects as expected.
Before continuing, let's see how many are errors (singular matrix, for instance).
err_list <- lapply(lm_list, function(x){
if(inherits(x, "error")) x else NULL
})
err_list <- err_list[!sapply(err_list, is.null)]
length(err_list)
#[1] 0
No errors.
So get the summaries of the objects of class "lm".
good_list <- lapply(lm_list, function(x){
if(inherits(x, "lm")) x else NULL
})
good_list <- good_list[!sapply(good_list, is.null)]
smry_list <- lapply(good_list, summary)
smry_list[[1]]
#
#Call:
# lm(formula = fmla, data = X)
#Residuals:
# Min 1Q Median 3Q Max
#-34.654 -9.487 -1.985 9.486 50.213
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.6449 1.5237 0.423 0.673
#X1 -7.3969 1.5074 -4.907 3.68e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 15.02 on 98 degrees of freedom
#Multiple R-squared: 0.1972, Adjusted R-squared: 0.189
#F-statistic: 24.08 on 1 and 98 DF, p-value: 3.684e-06

R: lm() result differs when using `weights` argument and when using manually reweighted data

In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.

how to generate data on the a given value of coefficient of multiple determination in R?

I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.

Resources