I need to write the following formulas in R. The STAT formula is copying effects of oneway.test-function.
where sample variance is
and
The variables are: m - number of samples, n - sample size, vector sample_means - mean of each sample and vector sample_vars - sample variance of each sample.
I'm trying to work with the following code, but it doesn't give the correct results when I compare it to aov:
my_anova <- function(m, n, sample_means, sample_vars) {
overall_mean <- mean(sample_means)
sample_vars <- sum((sample_means - overall_mean)^2)/(m-1)
STAT <- (n*sample_vars)/(sum(sample_vars/m))
PVAL <- pf(STAT, m - 1, m*(n - 1), lower.tail = FALSE)
}
Not very sure where you obtained the formulas above, but from what I can gather, you want to obtain the F stats and p value for a one way anova. n should be the degree of freedom and not sample size. Try using this table:
So bottom line is SSF should always be the sum of residuals between your predicted mean and overall mean, whereas SSE is the sum of residuals between your predicted mean and actual values. Then you divide by the corresponding degree of freedom. It should be like below:
my_aov <- function(sample_values, sample_means,n){
overall_mean = mean(sample_values)
SSF = sum((sample_means - overall_mean)^2)
SSE = sum((sample_values - sample_means)^2)
DoF = c(n,length(sample_values)-1-n)
Mean_Square = c(SSF/DoF[1] , SSE/DoF[2])
FSTAT = c(Mean_Square[1]/Mean_Square[2],NA)
PVAL <- pf(FSTAT, DoF[1], DoF[2], lower.tail = FALSE)
cbind(Sum_of_Squares= c(SSF,SSE),DoF,Mean_Square,FSTAT,PVAL)
}
Using an example:
values = iris$Sepal.Length
Species_values = tapply(iris$Sepal.Length,iris$Species,mean)
predicted_values = Species_values[as.character(iris$Species)]
# since there are 3 groups, degree of freedom is 3-1
n = length(unique(iris$Species)) - 1
my_aov(values,predicted_values,n)
Sum_of_Squares DoF Mean_Square FSTAT PVAL
[1,] 63.21213 2 31.6060667 119.2645 1.669669e-31
[2,] 38.95620 147 0.2650082 NA NA
Compare with:
summary(aov(Sepal.Length ~ Species,data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Related
I have a data frame with many columns. The first column contains categories such as "System 1", "System 2", and the second column has numbers that represent the 0's and 1's. Please see below :
For example:
SYSTEM
Q1
Q2
S1
0
1
S1
1
0
S2
1
1
S2
0
0
S2
1
1
I have this code in R to run Bootstrap 95% CI for mean
function to obtain mean from the data (with indexing).
Here is my code:
m <- 1e4
n <- 5
set.seed(42)
df2 <- data.frame(SYSTEM=rep(c('S1', 'S2'), each=n/2), matrix(sample(0:1, m*n, replace=TRUE), m, n))
names(df2)[-1] <- paste0('Q', 1:n)
set.seed(0)
library(boot)
#define function to calculate fitted regression coefficients
coef_function <- function(formula, data, indices) {
d <- data[indices,] #allows boot to select sample
fit <- lm(formula, data=d) #fit regression model
return(coef(fit)) #return coefficient estimates of model
}
#perform bootstrapping with 2000 replications
reps <- boot(data=df2, statistic=coef_function, R=2000, formula=Q1~Q2)
#view results of boostrapping
reps
#calculate adjusted bootstrap percentile (BCa) intervals
boot.ci(reps, type="bca", index=1) #intercept of model
boot.ci(reps, type="bca", index=2) #disp predictor variable
Result should be :
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df2, statistic = coef_function, R = 2000, formula = Q1 ~
Q2)
Bootstrap Statistics :
original bias std. error
t1* 0.600 0.00082 0.074
t2* -0.073 -0.00182 0.099
> boot.ci(reps, type="bca", index=1) #intercept of model
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = reps, type = "bca", index = 1)
Intervals :
Level BCa
95% ( 0.45, 0.74 )
Calculations and Intervals on Original Scale
> boot.ci(reps, type="bca", index=2) #disp predictor variable
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = reps, type = "bca", index = 2)
Intervals :
Level BCa
95% (-0.26, 0.13 )
Calculations and Intervals on Original Scale
Here I'm only using Q1 and Q2. I also didn't use group by.
I don't know where if this possible to do for groups and columns at once.
Thank you in advance.
If 'Q1' is the response variable, we may group by 'SYSTEM', then loop across the columns 'Q2' to 'Q5', create the formula from the column name (cur_column()) with 'Q1' in reformulate and pass it on to boot
library(boot)
library(dplyr)
out <- df2 %>%
group_by(SYSTEM) %>%
summarise(across(Q2:Q5,
~ list(boot(cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = 'Q1')))), .groups = 'drop')
-output
> out
# A tibble: 2 × 5
SYSTEM Q2 Q3 Q4 Q5
<chr> <list> <list> <list> <list>
1 S1 <boot> <boot> <boot> <boot>
2 S2 <boot> <boot> <boot> <boot>
If we extract the column, the output will be
> out$Q2
[[1]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = "Q1"))
Bootstrap Statistics :
original bias std. error
t1* 0.48025529 -0.0001032709 0.01019634
t2* 0.02355538 0.0003813531 0.01412119
[[2]]
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cur_data(), statistic = coef_function, R = 2000,
formula = reformulate(cur_column(), response = "Q1"))
Bootstrap Statistics :
original bias std. error
t1* 0.49564873 -0.0002947112 0.009942382
t2* 0.01850984 0.0003610360 0.013914520
Here's my problem: Why do the following procedures (classical version and custom function) for planned contrasts lead to different results for the estimate and the 95%CI? Please note that I copied the custom function from this website.
#classical version
data(mtcars)
#set Helmert contrasts
cyl2<-c(-1,1,0)
cyl1<-c(-1,-1,2)
mtcars$cyl<-factor(mtcars$cyl)
contrasts(mtcars$cyl) <-cbind(c1,c2)
classical<-summary.lm(aov(disp~cyl, mtcars))
#custom function (I want to use it because it includes results for equal AND unequal variances --> if the custom function is correct, results for equal variances should be the same as in the classical example):
oneway <- function(dv, group, contrast, alpha = .05) {
# -- arguments --
# dv: vector of measurements (i.e., dependent variable)
# group: vector that identifies which group the dv measurement came from
# contrast: list of named contrasts
# alpha: alpha level for 1 - alpha confidence level
# -- output --
# computes confidence interval and test statistic for a linear contrast of population means in a between-subjects design
# returns a data.frame object
# estimate (est), standard error (se), t-statistic (z), degrees of freedom (df), two-tailed p-value (p), and lower (lwr) and upper (upr) confidence limits at requested 1 - alpha confidence level
# first line reports test statistics that assume variances are equal
# second line reports test statistics that do not assume variances are equal
# means, standard deviations, and sample sizes
ms <- by(dv, group, mean, na.rm = TRUE)
vars <- by(dv, group, var, na.rm = TRUE)
ns <- by(dv, group, function(x) sum(!is.na(x)))
# convert list of contrasts to a matrix of named contrasts by row
contrast <- matrix(unlist(contrast), nrow = length(contrast), byrow = TRUE, dimnames = list(names(contrast), NULL))
# contrast estimate
est <- contrast %*% ms
# welch test statistic
se_welch <- sqrt(contrast^2 %*% (vars / ns))
t_welch <- est / se_welch
# classic test statistic
mse <- anova(lm(dv ~ factor(group)))$"Mean Sq"[2]
se_classic <- sqrt(mse * (contrast^2 %*% (1 / ns)))
t_classic <- est / se_classic
# if dimensions of contrast are NULL, nummer of contrasts = 1, if not, nummer of contrasts = dimensions of contrast
num_contrast <- ifelse(is.null(dim(contrast)), 1, dim(contrast)[1])
df_welch <- rep(0, num_contrast)
df_classic <- rep(0, num_contrast)
# makes rows of contrasts if contrast dimensions aren't NULL
if(is.null(dim(contrast))) contrast <- t(as.matrix(contrast))
# calculating degrees of freedom for welch and classic
for(i in 1:num_contrast) {
df_classic[i] <- sum(ns) - length(ns)
df_welch[i] <- sum(contrast[i, ]^2 * vars / ns)^2 / sum((contrast[i, ]^2 * vars / ns)^2 / (ns - 1))
}
# p-values
p_welch <- 2 * (1 - pt(abs(t_welch), df_welch))
p_classic <- 2 * (1 - pt(abs(t_classic), df_classic))
# 95% confidence intervals
lwr_welch <- est - se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
upr_welch <- est + se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
lwr_classic <- est - se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
upr_classic <- est + se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
# output
data.frame(contrast = rep(rownames(contrast), times = 2),
equal_var = rep(c("Assumed", "Not Assumed"), each = num_contrast),
est = rep(est, times = 2),
se = c(se_classic, se_welch),
t = c(t_classic, t_welch),
df = c(df_classic, df_welch),
p = c(p_classic, p_welch),
lwr = c(lwr_classic, lwr_welch),
upr = c(upr_classic, upr_welch))
}
#results for mtcars with and without Welch correction:
custom<-(with(mtcars,
oneway(dv = disp, group= cyl, contrast = list (cyl1=c(-1,-1,2), cyl2 =c(-1,1,0)))))
Now results are the same for p and t for the classical and the custom version, as expected (at least when equal_var = Assumed). But why are the estimate and the 95%CIs different?
> custom
contrast equal_var est se t df p lwr upr
1 cyl1 Assumed 417.74935 37.20986 11.226845 29.000000 4.487966e-12 341.64664 493.8521
2 cyl2 Assumed 78.17792 24.96113 3.131986 29.000000 3.945539e-03 27.12667 129.2292
3 cyl1 Not Assumed 417.74935 40.30748 10.364066 18.452900 3.985000e-09 333.21522 502.2835
4 cyl2 Not Assumed 78.17792 17.67543 4.422972 9.224964 1.566927e-03 38.34147 118.0144
> classical
Call:
aov(formula = disp ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-77.300 -30.586 -6.568 20.814 118.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 213.850 9.507 22.494 < 2e-16 ***
cyl1 69.625 6.202 11.227 4.49e-12 ***
cyl2 39.089 12.481 3.132 0.00395 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 51.63 on 29 degrees of freedom
Multiple R-squared: 0.8377, Adjusted R-squared: 0.8265
F-statistic: 74.83 on 2 and 29 DF, p-value: 3.551e-12
PS: This was my best attempt to solve this problem. Alternatively, I would be happy for any ideas on how to get estimates and 95%CIs for Welch-corrected contrasts in R that would not involve relying on custom functions from blogs.
When we have a linear model with a factor variable X (with levels A, B, and C)
y ~ factor(X) + Var2 + Var3
The result shows the estimate XB and XC which is differences B - A and C - A. (suppose that the reference is A).
If we want to know the p-value of the difference between B and C: C - B,
we should designate B or C as a reference group and re-run the model.
Can we get the p-values of the effect B - A, C - A, and C - B at one time?
You are looking for linear hypothesis test by check p-value of some linear combination of regression coefficients. Based on my answer: How to conduct linear hypothesis test on regression coefficients with a clustered covariance matrix?, where we only considered sum of coefficients, I will extend the function LinearCombTest to handle more general cases, supposing alpha as some combination coefficients of variables in vars:
LinearCombTest <- function (lmObject, vars, alpha, .vcov = NULL) {
## if `.vcov` missing, use the one returned by `lm`
if (is.null(.vcov)) .vcov <- vcov(lmObject)
## estimated coefficients
beta <- coef(lmObject)
## linear combination of `vars` with combination coefficients `alpha`
LinearComb <- sum(beta[vars] * alpha)
## get standard errors for sum of `LinearComb`
LinearComb_se <- sum(alpha * crossprod(.vcov[vars, vars], alpha)) ^ 0.5
## perform t-test on `sumvars`
tscore <- LinearComb / LinearComb_se
pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
## return a matrix
form <- paste0("(", paste(alpha, vars, sep = " * "), ")")
form <- paste0(paste0(form, collapse = " + "), " = 0")
matrix(c(LinearComb, LinearComb_se, tscore, pvalue), nrow = 1L,
dimnames = list(form, c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
}
Consider a simple example, where we have a balanced design for three groups A, B and C, with group mean 0, 1, 2, respectively.
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1226684 0.09692277 1.265631 2.066372e-01
#xB 0.9317800 0.13706949 6.797866 5.823987e-11
#xC 2.0445528 0.13706949 14.916177 6.141008e-38
Since A is the reference level, xB is giving B - A while xC is giving C - A. Suppose we are now interested in the difference between group B and C, i.e., C - B, we can use
LinearCombTest(fit, c("xC", "xB"), c(1, -1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * xC) + (-1 * xB) = 0 1.112773 0.1370695 8.118312 1.270686e-14
Note, this function is also handy to work out the group mean of B and C, that is (Intercept) + xB and (Intercept) + xC:
LinearCombTest(fit, c("(Intercept)", "xB"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xB) = 0 1.054448 0.09692277 10.87926 2.007956e-23
LinearCombTest(fit, c("(Intercept)", "xC"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xC) = 0 2.167221 0.09692277 22.36029 1.272811e-65
Alternative solution with lsmeans
Consider the above toy example again:
library(lsmeans)
lsmeans(fit, spec = "x", contr = "revpairwise")
#$lsmeans
# x lsmean SE df lower.CL upper.CL
# A 0.1226684 0.09692277 297 -0.06807396 0.3134109
# B 1.0544484 0.09692277 297 0.86370603 1.2451909
# C 2.1672213 0.09692277 297 1.97647888 2.3579637
#
#Confidence level used: 0.95
#
#$contrasts
# contrast estimate SE df t.ratio p.value
# B - A 0.931780 0.1370695 297 6.798 <.0001
# C - A 2.044553 0.1370695 297 14.916 <.0001
# C - B 1.112773 0.1370695 297 8.118 <.0001
#
#P value adjustment: tukey method for comparing a family of 3 estimates
The $lsmeans domain returns the marginal group mean, while $contrasts returns pairwise group mean difference, since we have used "revpairwise" contrast. Read p.32 of lsmeans for difference between "pairwise" and "revpairwise".
Well this is certainly interesting, as we can compare with the result from LinearCombTest. We see that LinearCombTest is doing correctly.
glht (general linear hypothesis testing) from multcomp package makes this sort of multiple hypothesis test easy without re-running a bunch of separate models. It is essentially crafting a customized contrast matrix based on your defined comparisons of interest.
Using your example comparisons and building on the data #ZheyuanLi provided:
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
library(multcomp)
my_ht <- glht(fit, linfct = mcp(x = c("B-A = 0",
"C-A = 0",
"C-B = 0")))
summary(my_ht) will give you the adjusted p-values for the comparisons of interest.
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#B - A == 0 0.9318 0.1371 6.798 1.11e-10 ***
#C - A == 0 2.0446 0.1371 14.916 < 1e-10 ***
#C - B == 0 1.1128 0.1371 8.118 < 1e-10 ***
You could use the library car, and use the function linearHypothesis with the parameter vcov.
Set this as the variance-covariance matrix of your model.
The function takes formulas or a matrix to describe the system of equations that you would like to test.
I just new in R for solving my statistical problem. Currently I'm working to estimate the parameters of a distribution using 200 random numbers (RN) that I generate using R. I generate 200 RN in 100 times. So it means there will be 100 kinds of 200 RN and I will estimate this 100 kinds of RN. It also means that there will be 100 kinds of estimation results.
So here is the code I use to generate the RN:
#Generate random numbers U~(0, 1)
rep <-100 #total replication
unif <-matrix(0, 200, rep)
for (k in 1: rep)
{
unif[,k] <- runif(200, min = 0, max = 1)
}
# Based on the 100 kinds of generated random numbers that follow U ~ (0.1), I will generate again 100 kinds of random numbers which follow the estimated distribution:
# Define parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
X <-matrix(0, 200, rep)
for (k in 1: rep)
{
X[,k] <- a*(log(1-((log(1-((unif[,k])^(1/c))))/(a*d))))^(1/b)
}
# Sorting the generated RN from the smallest to the largest
X_sort <-matrix(0, 200, rep)
for (k in 1: rep)
{
X_sort[,k] <- sort(X[,k])
}
Up here I've managed to generate 100 kinds of RN that will be estimated. However, the problem I face now is how to estimate this 100 kinds of RN. I can only estimate one. Here is the code I use for estimation the parameter with maxLik package and the estimation method is BHHH:
xi = X_sort[,1]
log_likelihood<-function(theta,xi){
p1 <- theta[1] #1st parameter
p2 <- theta[2] #2nd parameter
p3 <- theta[3] #3rd parameter
p4 <- theta[4] #4th parameter
logL=log((p4*p2*p3*((xi/p1)^(p2-1))*(exp(((xi/p1)^(p2))+(p4*p1*(1-(exp((xi/p1)^(p2)))))))*((1-(exp((p4*p1*(1-(exp((xi/p1)^(p2))))))))^(p3-1))))
return(logL)
}
library(maxLik);
# Initial parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
m <- maxLik(log_likelihood, start=c(a,b,c,d), xi = xi, method="bhhh");
summary(m)
Here is the result:
--------------------------------------------
Maximum Likelihood estimation
BHHH maximisation, 5 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -874.0024
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 4.790e+01 1.846e+00 25.953 < 2e-16 ***
[2,] 3.015e+00 1.252e-01 24.091 < 2e-16 ***
[3,] 1.717e-01 2.964e-02 5.793 6.91e-09 ***
[4,] 7.751e-05 6.909e-05 1.122 0.262
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
To estimate the other 99 RN, I have to change manually xi = X_sort[,k] for k=1,2,...,100 , so for the second RN, it should turn into X_sort[,2], and so on until the hundredth RN. I think this is not efficient because it takes a long time to replace them one by one. So is there a way to modify this code so that it did not take long for estimating the other RN?
Firstly, I'd suggest you to rewrite your code in more compact way.
1. Generating random numbers. There is no need to generate 100 vectors each of length 200 while we can generate a vector of length 100*200 and then write this vector column-wise into matrix. This can be done in the following way:
rep <-100
n <- 200
unif <- matrix(runif(rep*n, min = 0, max = 1), n, rep)
2. Calculating function of matrix. In R it is possible to apply vector functions to matrices. So in your case it will be:
X <- a*(log(1-((log(1-((unif)^(1/c))))/(a*d))))^(1/b)
3. Column-wise matrix sorting We can easily sort each column of the matrix using apply function. Parameter 2 means that we do it column-wise (1 stands for rows).
X_sort <- apply(X, 2, sort)
4. Performing estimations. Again, we can use apply here.
estimations <- apply(X_sort, 2, function(x) maxLik(log_likelihood, start=c(a,b,c,d),
xi = x, method="bhhh"))
Then to print all the summaries you can do the following:
lapply(estimations, summary)
I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.