Running 1000 simulations, and storing the output from LASSO - r

I'm running LASSO using the glmnet package using the following commands:
x_ss <- model.matrix(y_variable ~ X_variables, data="data")
y_ss <- c(y_variable)
cv.output_ss <- cv.glmnet(x_ss,y_ss, alpha=1, family="gaussian", type.measure="mse")
lambda.min_ss <- cv.output_ss$lambda.min
coef(cv.output_ss,s=lambda.min_ss)
From my understanding of LASSO regression, the estimates generated varies slightly every time I run it. As such, I am thinking of maybe generating 1000 simulations, and collecting the value of the estimates for my X-variable in question, so that I can report more meaningful stuff, like the mean and variance. Is there any way I can run this multiple times & 'save' the output so that I can get my mean and variance of the estimates?

Naturally, you can use sapply, lapply or even replicate.
E.g.
xy <- replicate(1000, {
# ...
coef(...)
}, simplify = FALSE)
would run the same code 1000 times and output the result of coef in a list. After the function has finished, you can manipulate xy in any way you want, e.g. extract desired statistics, bind it into a data.frame or a matrix and report on means, variances, distributions...

Related

How do I do Loops to run a regression for multiple periods

So I have this dataset, and I need to run a regression for every 60 periods, I understand how to run the regression and have been trying to repeat the regression for every 60 periods and have the betas in a matrix. But I don't quite understand how to get R to move down in my data to run a new regression on a seperate list of data points.
This is my current code, you can see that I attempted to engineer the dataset in a way to get it along but I know its wrong, I just don't know how to proceed. Thanks.
data.dt <-read.csv("Assignment 1.CSV")
## First install the AER package (Packages, Install, AER) and call it up for use:
library(AER)
library(texreg)
##(2) Estimating CAPM Betas and Idiosyncratic Volatilities
CAPM <-matrix(nrow=100, ncol=2)
n <- 0
for (i in (1+60*n):(60+60*n){
eqn1 <- summary(lm(Returns.SMALL.LoBM~Mkt.RF, data=data.dt))
CAPM[i,1] <-eqn1$coefficients[2,1]
CAPM[i,2] <-eqn1$coefficients[2,2]
n <-n+1
}

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?
Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098
You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse
You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)
Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

Many linear regressions

As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).
As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.

Using Kolmogorov Smirnov Test in R

I designed 3000 experiments, so that in one experiment there are 4 groups (treatment), in each group there are 50 individuals (subjects). For each experiment I do a standard one way ANOVA and proof if their p.values has a uni probability function under the null-hypothesis, but ks.test rejects this assumption and I cant see why?
subject<-50
treatment<-4
experiment<-list()
R<-3000
seed<-split(1:(R*subject),1:R)
for(i in 1:R){
e<-c()
for(j in 1:subject){
set.seed(seed[[i]][j])
e<-c(e,rmvnorm(mean=rep(0,treatment),sigma=diag(3,4),n=1,method="chol"))
}
experiment<-c(experiment,list(matrix(e,subject,treatment,byrow=T)))
}
p.values<-c()
for(e in experiment){
d<-data.frame(response=c(e),treatment=factor(rep(1:treatment,each=subject)))
p.values<-c(p.values,anova(lm(response~treatment,d))[1,"Pr(>F)"])
}
ks.test(p.values, punif,alternative = "two.sided")
I commented out the lines in your code that change the random seed, and got a P-value of .34. That was with an unknown seed, so for reproducibility, I did set.seed(1) and ran it again. This time, I got a P-value of 0.98.
As to why this makes a difference, I'm not an expert in PRNGs, but any decent generator will ensure successive draws are statistically independent for all practical purposes. The best ones will ensure the same for greater lags, eg the Mersenne Twister which is R's default PRNG guarantees it for lags up to 623 (IIRC). In fact, meddling with the seed is likely to impair the statistical properties of the draws.
Your code is also doing things in a really inefficient way. You're creating a list for the experiments, and adding one item for each experiment. Within each experiment, you also create a matrix, and add a row for each observation. Then you do something very similar for the P-values. I'll see if I can fix that up.
This is how I'd replace your code. Strictly speaking I could make it even tighter, by avoiding formulas, creating the bare model matrix, and calling lm.fit directly. But that would mean having to manually code up the ANOVA test rather than simply calling anova, which is more trouble than it's worth.
set.seed(1) # or any other number you like
x <- factor(rep(seq_len(treatment), each=subject))
p.values <- sapply(seq_len(R), function(r) {
y <- rnorm(subject * treatment, s=3)
anova(lm(y ~ x))[1,"Pr(>F)"]
})
ks.test(p.values, punif,alternative = "two.sided")
One-sample Kolmogorov-Smirnov test
data: p.values
D = 0.0121, p-value = 0.772
alternative hypothesis: two-sided

What is the best way to run a loop of regressions in R?

Assume that I have sources of data X and Y that are indexable, say matrices. And I want to run a set of independent regressions and store the result. My initial approach would be
results = matrix(nrow=nrow(X), ncol=(2))
for(i in 1:ncol(X)) {
matrix[i,] = coefficients(lm(Y[i,] ~ X[i,])
}
But, loops are bad, so I could do it with lapply as
out <- lapply(1:nrow(X), function(i) { coefficients(lm(Y[i,] ~ X[i,])) } )
Is there a better way to do this?
You are certainly overoptimizing here. The overhead of a loop is negligible compared to the procedure of model fitting and therefore the simple answer is - use whatever way you find to be the most understandable. I'd go for the for-loop, but lapply is fine too.
I do this type of thing with plyr, but I agree that it's not a processing efficency issue as much as what you are comfortable reading and writing.
If you just want to perform straightforward multiple linear regression, then I would recommend not using lm(). There is lsfit(), but I'm not sure it would offer than much of a speed up (I have never performed a formal comparison). Instead I would recommend performing the (X'X)^{-1}X'y using qr() and qrcoef(). This will allow you to perform multivariate multiple linear regression; that is, treating the response variable as a matrix instead of a vector and applying the same regression to each row of observations.
Z # design matrix
Y # matrix of observations (each row is a vector of observations)
## Estimation via multivariate multiple linear regression
beta <- qr.coef(qr(Z), Y)
## Fitted values
Yhat <- Z %*% beta
## Residuals
u <- Y - Yhat
In your example, is there a different design matrix per vector of observations? If so, you may be able to modify Z in order to still accommodate this.

Resources