Using Debugger in RStudio... And stopping at the right point - r

I want to go through the betas <- sapply(1 : nosim, function(i) makelms(x1, x2, x3)) on the rgp1 function of the code included in the image. I placed two breakpoints because just the one at the top didn't do the job; however, even with the stop at betas when I press Next on the Console top menu, it moves right past the rgp1 function to the rgp2 function, skipping the line I want to debug. Here is the sequence:
How can I make the debugger go through the sapply line of code without skipping it? I tried clicking "Step into the current function call" (the {} icon) without success.
The code, which belongs to the {swirl} package is:
makelms <- function(x1, x2, x3){
# Simulate a dependent variable, y, as x1
# plus a normally distributed error of mean 0 and
# standard deviation .3.
y <- x1 + rnorm(length(x1), sd = .3)
# Find the coefficient of x1 in 3 nested linear
# models, the first including only the predictor x1,
# the second x1 and x2, the third x1, x2, and x3.
c(coef(lm(y ~ x1))[2],
coef(lm(y ~ x1 + x2))[2],
coef(lm(y ~ x1 + x2 + x3))[2])
}
# Regressor generation process 1.
rgp1 <- function(){
print("Processing. Please wait.")
# number of samples per simulation
n <- 100
# number of simulations
nosim <- 1000
# set seed for reproducibility
set.seed(4321)
# Point A
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
# Point B
betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
round(apply(betas, 1, var), 5)
}
# Regressor generation process 2.
rgp2 <- function(){
print("Processing. Please wait.")
# number of samples per simulation
n <- 100
# number of simulations
nosim <- 1000
# set seed for reproducibility
set.seed(4321)
# Point C
x1 <- rnorm(n)
x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2)
# Point D
betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
round(apply(betas, 1, var), 5)
}
I tried the following:
debug(rgp1)
rgp1
and it went through all the steps without getting into the sapply part, which is what I really wanted to debug.
I also tried inserting browser() in several locations, as well as debugonce() without success.

Related

Coverage probability problem for moving block bootstrap

I applying moving block bootstrap (MBB) to a regression model using time series data. When I calculated the coverage probability of the estimators derived from the MBB the outcomes were anomalous except one coefficient (coeffcient for x1 which was set to be a continuous variable). Given that MBB is a well-establish method (see https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), I was wondering if there is something wrong with my code. I appreciate any input!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
As you can see, only the coverage probability for x1 coefficient estimator is ok. So anything wrong about my code? Or does this have something to do with MBB itself?
You're not really evaluating the coverage probabilities for the bootstrap. You need to build the confidence interval from the bootstrapped statistics, not making confidence intervals from the parametric models run on the bootstrapped samples. Here's how I would do it.
First, we can generate the data:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
Next, we could write a function that we will bootstrap. This function generates the bootstrap sample, runs the regression and saves the coefficients.
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
Next, we can run the function lots of times. Note that to generate a reliable 95% percentile confidence interval, you should have in the neighborhood of 1500-2500 bootstrap statistics. The farther the quantile you're trying to characterize is in the tails, the more bootstrap samples you need. So, the code below generates a single set of bootstrap coefficients:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
From this one set of bootstrap statistics, we can make a single confidence interval.
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
If you want to investigate the coverage probabilities of these intervals, you wold have to do what I did above, lots of times (we'll do 100, though to get better estimates, you would probably want to do more). We could then write a little function that would evaluate the coverage of one set of estimates:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
Then, you could apply that function to each of the bootstrap confidence intervals you generated. Using the rowMeans() function will get the mean of the coverage 1/0 values, which will be the coverage probability. In this case, using only 100 intervals, the coverage is 100%.
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1

Equivalent of Stata command `simulate` in R for Montecarlo Simulation

I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.

How to run different multiple linear regressions in R, Excel/VBA on a time series data for all different combinations of Explanatory Variables?

I am new to coding and R and would like your help. For my analysis, I am trying to run regression on a time series data with 1 dependent variable (Y) and 4 Independent Variables (X1, X2, X3, X4). All these variables (Y and X) have 4 different transformations (For example for X1 - X1, SQRT(X1), Square(X1) and Ln(X1)). I want to run the regressions for all the possible combinations of Y (Y, SQRT(Y), Square(Y), Ln(Y)) and all the combinations of X values so that in the end I can decide by looking at the R squared value which variable to choose in which of its transformation.
I am currently using the code in R for linear regression and changing the variables manually which is taking a lot of time. Maybe there is a loop or something I can use for the regressions? Waiting for your kind help. Thanks
lm(Y ~ X1 + X2 + X3 + X4)
lm(SQRT(Y) ~ X1 + X2 + X3 + X4)
lm(Square(Y) ~ X1 + X2 + X3 + X4)
lm(Ln(Y) ~ 1 + X2 + X3 + X4)
lm(Y ~ SQRT(X1) + X2 + X3 + X4)
lm(Y ~ Square(X1) + X2 + X3 + X4)
....
lm(ln(Y)~ ln(X1) + ln(X2) + ln(X3) + ln(X4))
This is my original code.
Regression10 <- lm(Final_Data_v2$`10 KW Installations (MW)`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regression10)
Regressionsqrt10 <- lm(Final_Data_v2$`SQRT(10 KW Installations (MW))`~Final_Data_v2$`10 KW Prio Installations (MW)`+Final_Data_v2$`FiT 10 KW (Cent/kWh)`+Final_Data_v2$`Electricity Prices 10 kW Cent/kW`+Final_Data_v2$`PV System Price (Eur/W)`)
summary(Regressionsqrt10)
And so on..
Here is the link to my DATA: LINK
This picks the transformations of RHS variables such that adjusted R-squared is maximized. This statistical approach will almost certainly lead to spurious results though.
# simulate some data
set.seed(0)
df <- data.frame(Y = runif(100),
X1 = runif(100),
X2 = runif(100),
X3 = runif(100),
X4 = runif(100))
# create new variables for log/sqrt transormations of every X and Y
for(x in names(df)){
df[[paste0(x, "_log")]] <- log(df[[x]])
df[[paste0(x, "_sqrt")]] <- sqrt(df[[x]])}
# all combinations of Y and X's
yVars <- names(df)[substr(names(df),1,1)=='Y']
xVars <- names(df)[substr(names(df),1,1)=='X']
df2 <- combn(c(yVars, xVars), 5) %>% data.frame()
# Ensure that formula is in form of some Y, some X1, some X2...
valid <- function(x){
ifelse(grepl("Y", x[1]) &
grepl("X1", x[2]) &
grepl("X2", x[3]) &
grepl("X3", x[4]) &
grepl("X4", x[5]), T, F)}
df2 <- df2[, sapply(df2, valid)]
# Create the formulas
formulas <- sapply(names(df2), function(x){
paste0(df2[[x]][1], " ~ ",
df2[[x]][2], " + ",
df2[[x]][3], " + ",
df2[[x]][4], " + ",
df2[[x]][5])})
# Run linear model for each formula
models <- lapply(formulas, function(x) summary(lm(as.formula(x), data=df)))
# Return the formula that maximizes R-squared
formulas[which.max(sapply(models, function(x) x[['adj.r.squared']]))]
"Y ~ X1 + X2 + X3 + X4_log"
Consider expand.grid for all combinations of coefficients, filtering on each column name using grep. Then call model function that takes a dynamic formula with Map (wrapper to mapply) to build list of lm objects (equal to all combinations of coefficients) at N=1,024 items.
Below runs the equivalent polynomial operations for square root and squared. Note: grep is only adjustment required to actual variable names.
coeffs <- c(names(Final_Data_v2),
paste0("I(", names(Final_Data_v2), "^(1/2))"),
paste0("I(", names(Final_Data_v2), "^2)"),
paste0("log(", names(Final_Data_v2), ")"))
# BUILD DATA FRAME OF ALL COMBNS OF VARIABLE AND TRANSFORMATION TYPES
all_combns <- expand.grid(y_var = coeffs[grep("10 KW Installations (MW)", coeffs)],
x_var1 = coeffs[grep("10 KW Prio Installations (MW)", coeffs)],
x_var2 = coeffs[grep("FiT 10 KW (Cent/kWh)", coeffs)],
x_var3 = coeffs[grep("Electricity Prices 10 kW Cent/kW", coeffs)],
x_var4 = coeffs[grep("PV System Price (Eur/W)", coeffs)],
stringsAsFactors = FALSE)
# FUNCTION WITH DYNAMIC FORMULA TO RECEIVE ALL POLYNOMIAL TYPES
proc_model <- function(y, x1, x2, x3, x4) {
myformula <- paste0("`",y,"`~`",x1,"`+`",x2,"`+`",x3,"`+`",x4,"`")
summary(lm(as.formula(myformula), data=Final_Data_v2))
}
# MAP CALL PASSING COLUMN VALUES ELEMENTWISE AS FUNCTION PARAMS
lm_list <- with(all_combns, Map(proc_model, y_var, x_var1, x_var2, x_var3, x_var4))

When simulating multivariate data for regression, how can I set the R-squared (example code included)?

I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this:
Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y is going to be 3^2+2^2+sd.value^2, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared) should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
I let the X1 and X2 vary, since this affects this sought sd.value.
The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
Note that a too precise tolerance might not allow you to find a result.
The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Here is another code to generate multiple linear regression with errors follow normal distribution:
OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}

profile confidence intervals in R: mle2

I am trying to use the command mle2, in the package bbmle. I am looking at p2 of "Maximum likelihood estimation and analysis with the bbmle package" by Bolker. Somehow I fail to enter the right start values. Here's the reproducible code:
l.lik.probit <-function(par, ivs, dv){
Y <- as.matrix(dv)
X <- as.matrix(ivs)
K <-ncol(X)
b <- as.matrix(par[1:K])
phi <- pnorm(X %*% b)
sum(Y * log(phi) + (1 - Y) * log(1 - phi))
}
n=200
set.seed(1000)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
latentz<- 1 + 2.0 * x1 + 3.0 * x2 + 5.0 * x3 + 8.0 * x4 + rnorm(n,0,5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- cbind(1,x1,x2,x3,x4)
values.start <-c(1,1,1,1,1)
foo2<-mle2(l.lik.probit, start=list(dv=0,ivs=values.start),method="BFGS",optimizer="optim", data=list(Y=y,X=x))
And this is the error I get:
Error in mle2(l.lik.probit, start = list(Y = 0, X = values.start), method = "BFGS", :
some named arguments in 'start' are not arguments to the specified log-likelihood function
Any idea why? Thanks for your help!
You've missed a couple of things, but the most important is that by default mle2 takes a list of parameters; you can make it take a parameter vector instead, but you have to work a little bit harder.
I have tweaked the code slightly in places. (I changed the log-likelihood function to a negative log-likelihood function, without which this would never work!)
l.lik.probit <-function(par, ivs, dv){
K <- ncol(ivs)
b <- as.matrix(par[1:K])
phi <- pnorm(ivs %*% b)
-sum(dv * log(phi) + (1 - dv) * log(1 - phi))
}
n <- 200
set.seed(1000)
dat <- data.frame(x1=rnorm(n),
x2=rnorm(n),
x3=rnorm(n),
x4=rnorm(n))
beta <- c(1,2,3,5,8)
mm <- model.matrix(~x1+x2+x3+x4,data=dat)
latentz<- rnorm(n,mean=mm%*%beta,sd=5)
y <- latentz
y[latentz < 1] <- 0
y[latentz >=1] <- 1
x <- mm
values.start <- rep(1,5)
Now we do the fit. The main thing is to specify vecpar=TRUE and to use parnames to let mle2 know the names of the elements in the parameter vector ...
library("bbmle")
names(values.start) <- parnames(l.lik.probit) <- paste0("b",0:4)
m1 <- mle2(l.lik.probit, start=values.start,
vecpar=TRUE,
method="BFGS",optimizer="optim",
data=list(dv=y,ivs=x))
As pointed out above for this particular example you have just re-implemented the probit regression (although I understand that you now want to extend this to allow for heteroscedasticity in some way ...)
dat2 <- data.frame(dat,y)
m2 <- glm(y~x1+x2+x3+x4,family=binomial(link="probit"),
data=dat2)
As a final note, I would say that you should check out the parameters argument, which allows you to specify a sub-linear model for any one of the parameters, and the formula interface:
m3 <- mle2(y~dbinom(prob=pnorm(eta),size=1),
parameters=list(eta~x1+x2+x3+x4),
start=list(eta=0),
data=dat2)
PS confint(foo2) appears to work fine (giving profile CIs as requested) with this set-up.
ae <- function(x,y) all.equal(unname(coef(x)),unname(coef(y)),tol=5e-5)
ae(m1,m2) && ae(m2,m3)

Resources