Many linear regressions - r

As part of my data analysis (on time series), I am checking for correlation between log-returns and realized volatility.
My data consists of time series spanning several years for around hundred different companies (large zoo object, ~2 MB filesize). To check for the above-mentioned correlation, I have used the following code to calculate several rolling variances (a.k.a. realized volatility):
rollvar5 <- sapply(returns, rollVar, n=5, na.rm=TRUE)
rollvar10 <- sapply(returns, rollVar, n=10, na.rm=TRUE)
using the simple fTrading function rollVar. I have then converted the rolling variances to zoo objects and added the date index (by exporting to the results to csv files and manually adding the date, and then using read.zoo - not very sophisticated but it works just fine).
Now I wish to create around 100 linear regression models, each linking the log-returns of a company to the realized volatility to the specified company. On an individual basis, this would look like the following:
lm_rollvar5 <- lm(returns$[5:1000,1] ~ rollvar5[5:1000,1])
lm_rollvar10 <- lm(returns$[10:1000,1] ~ rollvar10[10:1000,1])
This works without problems.
Now I wish to extend this to automatically create the linear regression models for all 100 companies. What I've tried was a simple for-loop:
NC <- ncol(returns)
for(i in 1:NC){
lm_rollvar5 <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
summary(lm_rollvar5)
lm_rollvar10 <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
summary(lm_rollvar10)
}
Is there any way I could optimize my approach? (i.e. how could I save all regression results in a simple way). Since now the for-loop just outputs hundreds of regression results, which is quite ineffective in analyzing the results.
I also tried to use the apply function but I am unsure how to use it in this case, since there are several timeseries objects (the returns and the rolling variances are saved in different objects as you can see).

As to your question how you could save all regression results in a simple way, this is a bit difficult to answer given that we don't know what you need to do, and what you consider "simple". However, you could define a list outside the loop and store each regression model in this list so that you can access the models without refitting them later. Try e.g.
NC <- ncol(returns)
lm_rollvar5 <- vector(mode="list", length=NC)
lm_rollvar10 <- vector(mode="list", length=NC)
for(i in 1:NC){
lm_rollvar5[[i]] <- lm(returns[5:1000],i] ~ rollvar5[5:1000,i])
lm_rollvar10[[i]] <- lm(returns[10:1000],i] ~ rollvar10[10:1000,i])
}
This gives you the fitted model for firm i at the i-th position in the list. In the same manner, you can also save the output of summary. Or you do sth like
my.summaries_5 <- lapply(lm_rollvar5, summary)
which gives you a list of summaries.

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

How to use Amelia package to get a best time series model in R

I'm trying to handle the missing data from a data frame use multiple imputations, professor advice me to use Amelia package. And I can build the time series model, but when I try to use lapply function to repeatedly run the time series model in each dataset, I got an error on the function in lapply.
My data frame have three variables, date, pm25, pm10. I can built an AR model for pm25.
And the imputation code is:
imp <- amelia(Exetertibble, m=50, ts = "date")
So I can get 50 imputations, and the time series model would like this:
model1 <- arima(imp$imputations$imp1$pm25, order = c(1,0,0))
Then I try to use lapply function:
extractcoefs <- lapply(imp$imputations, coef(model1))
There is an error, it said that the coef(model)is not a function or character or symbol.
My aim is to combine the 50 imputations and get the best result of coefficient of the time series model, I don't know how to write a correct function in there.
I also tried:
extractcoefs <- lapply(imp$imputations, coef(arima(order=c(1,0,0))))
and:
extractcoefs <- lapply(imp$imputations, arima(order=c(1,0,0)$coef))
No idea, what you are trying to do.
Look at this example for lapply:
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
So you give lapply a list und apply a function on each of the list elements. In this case the function is mean().
So for this example you will get the mean for a, beta and logic.
You are using lapply on imp$imputations.
You got imp$imputations from your call to the amelia() function. Which gives you an instance of S3 class "amelia". This instances includes several objects, one of these is a list imp, which has as list elements all the imputed datasets (in your case 50).
So using lapply(imp$imputations, coef(model1)) will apply the function in the second part on all imputed datasets. The only problem is, your second part isn't really a function. Also you can't apply coef on the imputed datasets. You must apply coef() on a model object, because it returns the model coefficients form the model.
I guess you want to do the following:
Generate your m=50 imputed datasets
Build a arima model for each dataset
Get the coefficients for each of this model
You could just use a for loop through the m=50 datasets for this.
Take this as an example:
data(africa)
imp <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", m = 5)
for (i in 1:length(imp$imputations))
{
model <- arima(imp$imputations[[i]]$gdp_pc)
coe <- coef(model)
print(coe)
}
This would give you 50 results of coef. (for the different arima models build on the different m=50 imputed datasets)

Running 1000 simulations, and storing the output from LASSO

I'm running LASSO using the glmnet package using the following commands:
x_ss <- model.matrix(y_variable ~ X_variables, data="data")
y_ss <- c(y_variable)
cv.output_ss <- cv.glmnet(x_ss,y_ss, alpha=1, family="gaussian", type.measure="mse")
lambda.min_ss <- cv.output_ss$lambda.min
coef(cv.output_ss,s=lambda.min_ss)
From my understanding of LASSO regression, the estimates generated varies slightly every time I run it. As such, I am thinking of maybe generating 1000 simulations, and collecting the value of the estimates for my X-variable in question, so that I can report more meaningful stuff, like the mean and variance. Is there any way I can run this multiple times & 'save' the output so that I can get my mean and variance of the estimates?
Naturally, you can use sapply, lapply or even replicate.
E.g.
xy <- replicate(1000, {
# ...
coef(...)
}, simplify = FALSE)
would run the same code 1000 times and output the result of coef in a list. After the function has finished, you can manipulate xy in any way you want, e.g. extract desired statistics, bind it into a data.frame or a matrix and report on means, variances, distributions...

lmList diagnostic plots - is it possible to subset data during a procedure or do data frames have to be subset and then passed in?

I am new to R and am trying to produce a vast number of diagnostic plots for linear models for a huge data set.
I discovered the lmList function from the nlme package.
This works a treat but what I now need is a means of passing in a fraction of this data into the plot function so that the resulting plots are not minute and unreadable.
In the example below 27 plots are nicely displayed. I want to produce diagnostics for much more data.
Is it necessary to subset the data first? (presumably with loops) or is it possible to subset within the plotting function (presumably with some kind of loop) rather than create 270 data frames and pass them all in separately?
I'm sorry to say that my R is so basic that I do not even know how to pass variables into names and values together in for loops (I tried using the paste function but it failed).
The data and function for the example are below – I would be picking values of Subject by their row numbers within the data frame. I grant that the 27 plots here show nicely but for sake of example it would be nice to split them into say into 3 sets of 9.
fm1 <- lmList(distance ~ age | Subject, Orthodont)
# observed versus fitted values by Subject
plot(fm1, distance ~ fitted(.) | Subject, abline = c(0,1))
Examples from:
https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/plot.lmList.html
I would be most grateful for help and hope that my question isn't insulting to anyone's intelligence or otherwise annoying.
I can't see how to pass a subset to the plot.lmList function. But, here is a way to do it using standard split-apply-combine strategy. Here, the Subjects are just split into three arbitrary groups of 9, and lmList is applied to each group.
## Make 3 lmLists
fits <- lapply(split(unique(Orthodont$Subject), rep(1:3, each=3)), function(x) {
eval(substitute(
lmList(distance ~ age | Subject, # fit the data to subset
data=Orthodont[Orthodont$Subject %in% x,]), # use the subset
list(x=x))) # substitue the actual x-values so the proper call gets stored
})
## Make plots
for (i in seq_along(fits)) {
dev.new()
print(plot(fits[[i]], distance ~ fitted(.) | Subject, abline = c(0,1)))
}

What is the best way to run a loop of regressions in R?

Assume that I have sources of data X and Y that are indexable, say matrices. And I want to run a set of independent regressions and store the result. My initial approach would be
results = matrix(nrow=nrow(X), ncol=(2))
for(i in 1:ncol(X)) {
matrix[i,] = coefficients(lm(Y[i,] ~ X[i,])
}
But, loops are bad, so I could do it with lapply as
out <- lapply(1:nrow(X), function(i) { coefficients(lm(Y[i,] ~ X[i,])) } )
Is there a better way to do this?
You are certainly overoptimizing here. The overhead of a loop is negligible compared to the procedure of model fitting and therefore the simple answer is - use whatever way you find to be the most understandable. I'd go for the for-loop, but lapply is fine too.
I do this type of thing with plyr, but I agree that it's not a processing efficency issue as much as what you are comfortable reading and writing.
If you just want to perform straightforward multiple linear regression, then I would recommend not using lm(). There is lsfit(), but I'm not sure it would offer than much of a speed up (I have never performed a formal comparison). Instead I would recommend performing the (X'X)^{-1}X'y using qr() and qrcoef(). This will allow you to perform multivariate multiple linear regression; that is, treating the response variable as a matrix instead of a vector and applying the same regression to each row of observations.
Z # design matrix
Y # matrix of observations (each row is a vector of observations)
## Estimation via multivariate multiple linear regression
beta <- qr.coef(qr(Z), Y)
## Fitted values
Yhat <- Z %*% beta
## Residuals
u <- Y - Yhat
In your example, is there a different design matrix per vector of observations? If so, you may be able to modify Z in order to still accommodate this.

Resources