I have an analysis i'd like to perform and I have planned out contrasts (not posthoc comparisons!) I would like to make between treatment groups. The treatment group variable has (k =) 4 levels. I plan to make a total of 3 different comparisons, and therefore,—if I understand correctly—I do not need to make any adjustments to the p-values that are calculated since the comparisons are k-1.
I would like to use the multcomp or lsmeans package in R to do this. My question is: Does anyone know if it is possible to do this planned comparison WITHOUT any adjustment made to the confidence intervals (and p-value)? As far as I can tell from the vignettes i've looked at and examples i've seen, the summary.glht() function makes an adjustment as the default and it is not clear to me what option would undo this.
If someone requires a reproducible example, they could use this example that I found on http://www.ats.ucla.edu/stat/r/faq/testing_contrasts.htm:
library(multcomp)
hsb <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
m1 <- lm(read ~ socst + factor(ses) * factor(female), data = hsb)
summary(m1)
K <- matrix(c(0, 0, 1, -1, 0, 0, 0), 1)
t <- glht(m1, linfct = K)
summary(t)
As far as I see your example, your question is a little strange. At least IMO, If you needn't adjust, you needn't use multcomp package (but in some situation, it saves us some time).
library(multcomp)
hsb <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
hsb$ses <- as.factor(hsb$ses)
m3 <- lm(read ~ socst + ses, data = hsb)
l3 <- glht(m3, linfct = mcp(ses = "Tukey"))
# mcp(~) doesn't run with some type of model. If so, you'll give the matrix directly.
# k3 <- matrix(c(0, 0, 1, 0,
# 0, 0, 0, 1,
# 0, 0, -1, 1), byrow = T, ncol = 4)
# rownames(k3) <- c("2-1", "3-1", "3-2")
# l3 <- glht(m3, linfct = k1)
summary(l3, test=adjusted("none")) # this is the result without adjustment
# Estimate Std. Error t value Pr(>|t|)
# 2 - 1 == 0 0.6531 1.4562 0.448 0.654
# 3 - 1 == 0 2.7034 1.6697 1.619 0.107
# 3 - 2 == 0 2.0503 1.3685 1.498 0.136
hsb$ses <- relevel(hsb$ses, ref="2") # change of the order of levels
m3.2 <- lm(read ~ socst + ses, data = hsb)
summary(m3) # "Without adjustment" means it's equivalent to original model's statistics.
# Estimate Std. Error t value Pr(>|t|)
# :
# ses2 0.65309 1.45624 0.448 0.654
# ses3 2.70342 1.66973 1.619 0.107
summary(m3.2)
# Estimate Std. Error t value Pr(>|t|)
# :
# ses3 2.05033 1.36846 1.498 0.136
# When argument is lmer.obj, summary(~, adjusted("none")) returns p.value by using z value with N(0, 1).
Related
I want to create a regression table with modelsummary (amazing package!!!) for multinomial logistic models run with nnet::multinom that includes clustered standard errors, as well as corresponding "significance" stars and summary statistics.
Unfortunately, I cannot do this automatically with the vcov parameter within modelsummary because the sandwich package that modelsummary uses does not support nnet objects.
I was able to calculate robust standard errors with a customized function originally developed by Daina Chiba and modified by Davenport, Soule, Armstrong (available from: https://journals.sagepub.com/doi/suppl/10.1177/0003122410395370/suppl_file/Davenport_online_supplement.pdf).
I was also able to include these standard errors in the modelsummary table instead of the original ones. Yet, neither the "significance" stars nor the model summary statistics adapt to these new standard errors. I think this is because they are calculated via broom::tidy automatically by modelsummary.
I would be thankful for any advice for how to include stars and summary statistics that correspond to the clustered standard errors and respective p-values.
Another smaller question I have is whether there is any easy way of "spreading" the model statistics (e.g. number of observations or R2) such that they center below all response levels of the dependent variable and not just the first level. I am thinking about a multicolumn solution in Latex.
Here is some example code that includes how I calculate the standard errors. (Note, that the calculated clustered SEs are extremely small because they don't make sense with the example mtcars data. The only take-away is that the respective stars should correspond to the new SEs, and they don't).
# load data
dat_multinom <- mtcars
dat_multinom$cyl <- sprintf("Cyl: %s", dat_multinom$cyl)
# run multinomial logit model
mod <- nnet::multinom(cyl ~ mpg + wt + hp, data = dat_multinom, trace = FALSE)
# function to calculate clustered standard errors
mlogit.clust <- function(model,data,variable) {
beta <- c(t(coef(model)))
vcov <- vcov(model)
k <- length(beta)
n <- nrow(data)
max_lev <- length(model$lev)
xmat <- model.matrix(model)
# u is deviance residuals times model.matrix
u <- lapply(2:max_lev, function(x)
residuals(model, type = "response")[, x] * xmat)
u <- do.call(cbind, u)
m <- dim(table(data[,variable]))
u.clust <- matrix(NA, nrow = m, ncol = k)
fc <- factor(data[,variable])
for (i in 1:k) {
u.clust[, i] <- tapply(u[, i], fc, sum)
}
cl.vcov <- vcov %*% ((m / (m - 1)) * t(u.clust) %*% (u.clust)) %*% vcov
return(cl.vcov = cl.vcov)
}
# get coefficients, variance, clustered standard errors, and p values
b <- c(t(coef(mod)))
var <- mlogit.clust(mod,dat_multinom,"am")
se <- sqrt(diag(var))
p <- (1-pnorm(abs(b/se))) * 2
# modelsummary table with clustered standard errors and respective p-values
modelsummary(
mod,
statistic = "({round(se,3)}),[{round(p,3)}]",
shape = statistic ~ response,
stars = c('*' = .1, '**' = .05, '***' = .01)
)
# modelsummary table with original standard errors and respective p-values
modelsummary(
models = list(mod),
statistic = "({std.error}),[{p.value}]",
shape = statistic ~ response,
stars = c('*' = .1, '**' = .05, '***' = .01)
)
This code produces the following tables:
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759*
-6.096***
(0.286),[0]
(0.007),[0]
mpg
-38.699
-46.849
(5.169),[0]
(6.101),[0]
wt
23.196
39.327
(3.18),[0]
(4.434),[0]
hp
6.722
7.493
(0.967),[0]
(1.039),[0]
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00
Note:
^^ * p < 0.1, ** p < 0.05, *** p < 0.01
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759*
-6.096***
(11.652),[0.063]
(0.371),[0.000]
mpg
-38.699
-46.849
(279.421),[0.891]
(448.578),[0.918]
wt
23.196
39.327
(210.902),[0.913]
(521.865),[0.941]
hp
6.722
7.493
(55.739),[0.905]
(72.367),[0.918]
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00
Note:
^^ * p < 0.1, ** p < 0.05, *** p < 0.01
This is not super easy at the moment, I just opened a Github issue to track progress. This should be easy to improve, however, so I expect changes to be published in the next release of the package.
In the meantime, you can install the dev version of modelsummary:
library(remotes)
install_github("vincentarelbundock/modelsummary")
Them, you can use the tidy_custom mechanism described here to override standard errors and p values manually:
library(modelsummary)
tidy_custom.multinom <- function(x, ...) {
b <- coef(x)
var <- mlogit.clust(x, dat_multinom, "am")
out <- data.frame(
term = rep(colnames(b), times = nrow(b)),
response = rep(row.names(b), each = ncol(b)),
estimate = c(t(b)),
std.error = sqrt(diag(var))
)
out$p.value <- (1-pnorm(abs(out$estimate / out$std.error))) * 2
row.names(out) <- NULL
return(out)
}
modelsummary(
mod,
output = "markdown",
shape = term ~ model + response,
stars = TRUE)
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759***
-6.096***
(0.286)
(0.007)
mpg
-38.699***
-46.849***
(5.169)
(6.101)
wt
23.196***
39.327***
(3.180)
(4.434)
hp
6.722***
7.493***
(0.967)
(1.039)
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00
I am attempting to find the relationship between sustained wind speed and wind gust multipliers [gust multiplier = (wind gust value) / (sustained wind speed)]. I initially used a power model (y = ax^b) and linear regression to model them, but I would like to attempt to fit the data to a power model with two terms (y = ax^b + cx^d) as well.
In the power model with one term, I was able to take the log of each variable and then that allowed me to save the coefficients (a and b values) for the power model. I am unsure of how to alter my code to be able to create a power model with two terms.
model = lm(log(data$GustMult)~log(data$SusWindSpeed)) # not sure how to change for a power model with two terms
coefs = unname(coef(model))
coefa = exp(coefs[1])
coefb = coefs[2]
coefc = exp(coefs[3]) #added for new model
coefd = coefs[4] #added for new model
rsquared = summary(model)$r.squared
I attempted to run the code as it is above, however that only provides the values for the a and b coefficients still.
Thank you in advance for any suggestions you may have!
I think DanY might be right, that there is no unique set of parameters that would fit this model. In any case, I think you would need to fit this as a non-linear model.
I have no idea what your data looks like, but let's say it's something like this:
set.seed(69)
SusWindSpeed <- rgamma(1000, 2, .125)
GustMult <- (-0.98)*SusWindSpeed^(0.12) + (0.1)*SusWindSpeed^(-0.32) + rnorm(1000, 3, 0.025)
data <- data.frame(SusWindSpeed, GustMult)
plot(data$SusWindSpeed, data$GustMult)
Then you could try fitting a non-linear model like this:
modA <- nls(GustMult ~ var_a * SusWindSpeed^var_b + var_c * SusWindSpeed^var_d + Const,
start = list(var_a = -1, var_b = 0.1, var_c = 0.1, var_d = -0.3, Const = 3),
data = data,
control = list(maxiter = 500))
summary(modA)
#>
#> Formula: GustMult ~ var_a * SusWindSpeed^var_b + var_c * SusWindSpeed^var_d +
#> Const
#>
#> Parameters:
#> Estimate Std. Error t value Pr(>|t|)
#> var_a -1.28314 0.72635 -1.767 0.0776 .
#> var_b 0.10381 0.04387 2.367 0.0181 *
#> var_c 0.01506 0.03884 0.388 0.6983
#> var_d -1.45958 3.32064 -0.440 0.6604
#> Const 3.38496 0.76287 4.437 1.01e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.02452 on 995 degrees of freedom
#>
#> Number of iterations to convergence: 22
#> Achieved convergence tolerance: 3.379e-07
And we can show that this is a reasonable fit:
test_SusWindSpeed <- seq(0, 60, 0.05)
plot.new()
plot(data$SusWindSpeed, data$GustMult)
points(test_SusWindSpeed,
predict(modA, newdata = list(SusWindSpeed = test_SusWindSpeed)),
col = "red")
However, you might find you struggle to get this model to converge, or that you get different results with different start parameters.
Created on 2020-07-09 by the reprex package (v0.3.0)
When we have a linear model with a factor variable X (with levels A, B, and C)
y ~ factor(X) + Var2 + Var3
The result shows the estimate XB and XC which is differences B - A and C - A. (suppose that the reference is A).
If we want to know the p-value of the difference between B and C: C - B,
we should designate B or C as a reference group and re-run the model.
Can we get the p-values of the effect B - A, C - A, and C - B at one time?
You are looking for linear hypothesis test by check p-value of some linear combination of regression coefficients. Based on my answer: How to conduct linear hypothesis test on regression coefficients with a clustered covariance matrix?, where we only considered sum of coefficients, I will extend the function LinearCombTest to handle more general cases, supposing alpha as some combination coefficients of variables in vars:
LinearCombTest <- function (lmObject, vars, alpha, .vcov = NULL) {
## if `.vcov` missing, use the one returned by `lm`
if (is.null(.vcov)) .vcov <- vcov(lmObject)
## estimated coefficients
beta <- coef(lmObject)
## linear combination of `vars` with combination coefficients `alpha`
LinearComb <- sum(beta[vars] * alpha)
## get standard errors for sum of `LinearComb`
LinearComb_se <- sum(alpha * crossprod(.vcov[vars, vars], alpha)) ^ 0.5
## perform t-test on `sumvars`
tscore <- LinearComb / LinearComb_se
pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
## return a matrix
form <- paste0("(", paste(alpha, vars, sep = " * "), ")")
form <- paste0(paste0(form, collapse = " + "), " = 0")
matrix(c(LinearComb, LinearComb_se, tscore, pvalue), nrow = 1L,
dimnames = list(form, c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
}
Consider a simple example, where we have a balanced design for three groups A, B and C, with group mean 0, 1, 2, respectively.
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1226684 0.09692277 1.265631 2.066372e-01
#xB 0.9317800 0.13706949 6.797866 5.823987e-11
#xC 2.0445528 0.13706949 14.916177 6.141008e-38
Since A is the reference level, xB is giving B - A while xC is giving C - A. Suppose we are now interested in the difference between group B and C, i.e., C - B, we can use
LinearCombTest(fit, c("xC", "xB"), c(1, -1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * xC) + (-1 * xB) = 0 1.112773 0.1370695 8.118312 1.270686e-14
Note, this function is also handy to work out the group mean of B and C, that is (Intercept) + xB and (Intercept) + xC:
LinearCombTest(fit, c("(Intercept)", "xB"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xB) = 0 1.054448 0.09692277 10.87926 2.007956e-23
LinearCombTest(fit, c("(Intercept)", "xC"), c(1, 1))
# Estimate Std. Error t value Pr(>|t|)
#(1 * (Intercept)) + (1 * xC) = 0 2.167221 0.09692277 22.36029 1.272811e-65
Alternative solution with lsmeans
Consider the above toy example again:
library(lsmeans)
lsmeans(fit, spec = "x", contr = "revpairwise")
#$lsmeans
# x lsmean SE df lower.CL upper.CL
# A 0.1226684 0.09692277 297 -0.06807396 0.3134109
# B 1.0544484 0.09692277 297 0.86370603 1.2451909
# C 2.1672213 0.09692277 297 1.97647888 2.3579637
#
#Confidence level used: 0.95
#
#$contrasts
# contrast estimate SE df t.ratio p.value
# B - A 0.931780 0.1370695 297 6.798 <.0001
# C - A 2.044553 0.1370695 297 14.916 <.0001
# C - B 1.112773 0.1370695 297 8.118 <.0001
#
#P value adjustment: tukey method for comparing a family of 3 estimates
The $lsmeans domain returns the marginal group mean, while $contrasts returns pairwise group mean difference, since we have used "revpairwise" contrast. Read p.32 of lsmeans for difference between "pairwise" and "revpairwise".
Well this is certainly interesting, as we can compare with the result from LinearCombTest. We see that LinearCombTest is doing correctly.
glht (general linear hypothesis testing) from multcomp package makes this sort of multiple hypothesis test easy without re-running a bunch of separate models. It is essentially crafting a customized contrast matrix based on your defined comparisons of interest.
Using your example comparisons and building on the data #ZheyuanLi provided:
x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1
fit <- lm(y ~ x)
library(multcomp)
my_ht <- glht(fit, linfct = mcp(x = c("B-A = 0",
"C-A = 0",
"C-B = 0")))
summary(my_ht) will give you the adjusted p-values for the comparisons of interest.
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#B - A == 0 0.9318 0.1371 6.798 1.11e-10 ***
#C - A == 0 2.0446 0.1371 14.916 < 1e-10 ***
#C - B == 0 1.1128 0.1371 8.118 < 1e-10 ***
You could use the library car, and use the function linearHypothesis with the parameter vcov.
Set this as the variance-covariance matrix of your model.
The function takes formulas or a matrix to describe the system of equations that you would like to test.
I'm trying to reproduce this stata example and move from stargazer to texreg. The data is available here.
To run the regression and get the se I run this code:
library(readstata13)
library(sandwich)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- as.integer(rownames(model_result$model))
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
sqrt(diag(vcovCL))
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <- lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll, data = elemapi2)
se.lm1 <- cluster_se(model_result = lm1, data = elemapi2, cluster = "dnum")
stargazer::stargazer(lm1, type = "text", style = "aer", se = list(se.lm1))
==========================================================
api00
----------------------------------------------------------
acs_k3 6.954
(6.901)
acs_46 5.966**
(2.531)
full 4.668***
(0.703)
enroll -0.106**
(0.043)
Constant -5.200
(121.786)
Observations 395
R2 0.385
Adjusted R2 0.379
Residual Std. Error 112.198 (df = 390)
F Statistic 61.006*** (df = 4; 390)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
texreg produces this:
texreg::screenreg(lm1, override.se=list(se.lm1))
========================
Model 1
------------------------
(Intercept) -5.20
(121.79)
acs_k3 6.95
(6.90)
acs_46 5.97 ***
(2.53)
full 4.67 ***
(0.70)
enroll -0.11 ***
(0.04)
------------------------
R^2 0.38
Adj. R^2 0.38
Num. obs. 395
RMSE 112.20
========================
How can I fix the p-values?
Robust Standard Errors with texreg are easy: just pass the coeftest directly!
This has become much easier since the question was last answered: it appears you can now just pass the coeftest with the desired variance-covariance matrix directly. Downside: you lose the goodness of fit statistics (such as R^2 and number of observations), but depending on your needs, this may not be a big problem
How to include robust standard errors with texreg
> screenreg(list(reg1, coeftest(reg1,vcov = vcovHC(reg1, 'HC1'))),
custom.model.names = c('Standard Standard Errors', 'Robust Standard Errors'))
=============================================================
Standard Standard Errors Robust Standard Errors
-------------------------------------------------------------
(Intercept) -192.89 *** -192.89 *
(55.59) (75.38)
x 2.84 ** 2.84 **
(0.96) (1.04)
-------------------------------------------------------------
R^2 0.08
Adj. R^2 0.07
Num. obs. 100
RMSE 275.88
=============================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
To generate this example, I created a dataframe with heteroscedasticity, see below for full runnable sample code:
require(sandwich);
require(texreg);
set.seed(1234)
df <- data.frame(x = 1:100);
df$y <- 1 + 0.5*df$x + 5*100:1*rnorm(100)
reg1 <- lm(y ~ x, data = df)
First, notice that your usage of as.integer is dangerous and likely to cause problems once you use data with non-numeric rownames. For instance, using the built-in dataset mtcars whose rownames consist of car names, your function will coerce all rownames to NA, and your function will not work.
To your actual question, you can provide custom p-values to texreg, which means that you need to compute the corresponding p-values. To achieve this, you could compute the variance-covariance matrix, compute the test-statistics, and then compute the p-value manually, or you just compute the variance-covariance matrix and supply it to e.g. coeftest. Then you can extract the standard errors and p-values from there. Since I am unwilling to download any data, I use the mtcars-data for the following:
library(sandwich)
library(lmtest)
library(texreg)
cluster_se <- function(model_result, data, cluster){
model_variables <- intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model) # changed to be able to work with mtcars, not tested with other data
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(model_result), 2, function(x) tapply(x, cl, sum));
vcovCL <- dfc*sandwich(model_result, meat=crossprod(uj)/N)
}
lm1 <- lm(formula = mpg ~ cyl + disp, data = mtcars)
vcov.lm1 <- cluster_se(model_result = lm1, data = mtcars, cluster = "carb")
standard.errors <- coeftest(lm1, vcov. = vcov.lm1)[,2]
p.values <- coeftest(lm1, vcov. = vcov.lm1)[,4]
texreg::screenreg(lm1, override.se=standard.errors, override.p = p.values)
And just for completeness sake, let's do it manually:
t.stats <- abs(coefficients(lm1) / sqrt(diag(vcov.lm1)))
t.stats
(Intercept) cyl disp
38.681699 5.365107 3.745143
These are your t-statistics using the cluster-robust standard errors. The degree of freedom is stored in lm1$df.residual, and using the built in functions for the t-distribution (see e.g. ?pt), we get:
manual.p <- 2*pt(-t.stats, df=lm1$df.residual)
manual.p
(Intercept) cyl disp
1.648628e-26 9.197470e-06 7.954759e-04
Here, pt is the distribution function, and we want to compute the probability of observing a statistic at least as extreme as the one we observe. Since we testing two-sided and it is a symmetric density, we first take the left extreme using the negative value, and then double it. This is identical to using 2*(1-pt(t.stats, df=lm1$df.residual)). Now, just to check that this yields the same result as before:
all.equal(p.values, manual.p)
[1] TRUE
I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.