Running multiple OLS regressions with matrix algebra in R - r

I am having trouble running multiple regressions in R. I have a matrix of returns, which have to be regressed against a vector. To be clear, I have a matrix of 1794 assets, which each, individually, need to be regressed against the SPX. I then need to roll these regressions through time, so I am trying to do this with matrix algebra to get the best speed possible.
I have code like the following:
for (i in 1:(nrow(fundReturns) - 59)){
tempMatrix <- fundReturns[i:(i+59),]
tempSPX <- spxReturns[i:(i+59)]
theSubset <- sapply(tempMatrix, function(x) sum(is.na(x))) == 0
tempMatrix <- tempMatrix[,theSubset]
theBeta <- solve(crossprod(tempMatrix), crossprod(tempMatrix, tempSPX))
theBetas[index(fundReturns)[(i + 59)], colnames(theBeta)] = theBeta
}
This breaks on the first iteration, at the 'theBeta <- ...' line. I am getting the following error:
Error in solve.default(t(tempMatrix) %*% tempMatrix) :
Lapack routine dgesv: system is exactly singular: U[273,273] = 0
I assume this means I have some singular matrix somewhere, which does not make sense to me. (If someone could explain what this error means exactly, that would be worth bonus points.) I do not see where the error could be coming from, and to boot, I attempted to check with the following code:
for (i in 1:1794){
tempMatrix2 <- tempMatrix[,i]
theBeta <- solve(t(tempMatrix2) %*% tempMatrix2) %*% (t(tempMatrix2) %*% tempSPX)
}
This code runs fine, and produces a beta for each individual column. I assume there is something I am missing in my code, or have some calculation backwards, but I cannot find it. Please SO gods, help me.
If it helps, here is the following:
> dim(tempSPX)
[1] 60 1
> dim(tempMatrix)
[1] 60 1794

Related

How to create multiple matrices based on a formula using two data frames and sum those matrices up in one go?

I'm fairly new to R and am thus not that knowledgeable yet about its different functionalities. I'm wondering if there is a more efficient way to replicate the following other than writing and running 230 lines of code.
I have two matrices, Z and E, which contain continuous numerical data and have the dimensions 7x229 and 17x229 respectively. For each column (so 229 times) I want to create a new 119x119 matrix by using the (repeated) formula below
ZZEE1 <- kronecker((Z[,1] %*% t(Z[,1])), (E[,1] %*% t(E[,1])))
ZZEE2 <- kronecker((Z[,2] %*% t(Z[,2])), (E[,2] %*% t(E[,2])))
ZZEE3 <- kronecker((Z[,3] %*% t(Z[,3])), (E[,3] %*% t(E[,3])))
ZZEE4 <- kronecker((Z[,4] %*% t(Z[,4])), (E[,4] %*% t(E[,4])))
#...
ZZEE228 <- kronecker((Z[,228] %*% t(Z[,228])), (E[,228] %*% t(E[,228])))
ZZEE229 <- kronecker((Z[,229] %*% t(Z[,229])), (E[,229] %*% t(E[,229])))
After this is done, I want to add all 229 matrices up into one matrix like this (not complete)
Sum_ZZEE <- ZZEE1 + ZZEE2 + ZZEE3 + ZZEE4 + ZZEE228 + ZZEE229 #Sum of all matrices from ZZEE1 to ZZEE229
Is there a quicker fix out there that will do exactly this? I have tried to find an answer online but did not find something that worked or something that I understood to the extent that I could modify it to my own data/code. As far as I understood it, there might be a fix with the function() function, but I would not know how to code it correctly. Getting the 'Sum_ZZEE' matrix is the final goal, I do not necessarily need the individual matrices stored in the workspace. Much obliged!
First construct a list of matrices: the following two code chunks are equivalent, use whichever is clearer to you.
ZZ_list <- lapply(1:229,
function(i) kronecker((Z[,i] %*% t(Z[,i])), (E[,i] %*% t(E[,i])))
)
or
ZZ_list <- list()
for (i in 1:229) {
ZZ_list[[i]] <- kronecker((Z[,i] %*% t(Z[,i])), (E[,i] %*% t(E[,i])))
}
Then use Reduce() (unfortunately sum() doesn't work the way you want):
answer <- Reduce("+", ZZ_list)
There might be some super-clever answer that works in pure linear algebra (e.g. with stacking/unstacking operators) ...

R: loop won't run due to a seq.default error

I am trying to calculate the area under the curve for every 10ms of a short piece of EEG wave. To first practice this I made a small dataset to run the auc (from package {flux}) function on.
x <- seq(1:10)
y <- c(0:4,5:1)
df <- data.frame(x,y)
attach(df)
plot(x,y)
for (i in 1:10){
x1 <- c(i,(i+1))
y1 <- c(subset(y, x == i),subset(y, x == (i+1)))
auc(x1,y1,thresh = 0)
rm(y1,x1,i)
}
The loop should try to subset two data points from each variable and then run a auc over those data points. However, when running the loop, I get this error:
Error in seq.default(x[1], x[2], length.out = dens) : 'to' must be a finite number
When I run the subset and auc code outside of the loop, it works no problem. Can anyone tell me what's going wrong in the loop?
Thanks for updating the question. It's not because of the control statement (for loop), the error gets thrown precisely when i=10 -- because the length of your x-coords and y-coords vectors are different. Specifically c(10,11) vs c(1). But you have no point at x=11 !
just stop the loop early, at the appropriate time

Error in Gradient Descent Calculation

I tried to write a function to calculate gradient descent for a linear regression model. However the answers I was getting does not match the answers I get using the normal equation method.
My sample data is:
df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))
with c(4,6,8) being the y values.
lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){
n_features <- length(df) #n_features is the number of features in the data set
#using mean normalization to scale features
if(scale==TRUE){
for (i in 1:(n_features)){
df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
}
}
y_data <- df[,y_col]
df[,y_col] <- NULL
par <- rep(1,n_features)
df <- merge(1,df)
data_mat <- data.matrix(df)
#we need a temp_arr to store each iteration of parameter values so that we can do a
#simultaneous update
temp_arr <- rep(0,n_features)
diff <- 1
while(diff>0.0000001){
for (i in 1:(n_features)){
temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
}
diff <- par[1]-temp_arr[1]
print(diff)
par <- temp_arr
}
return(par)
}
Running this function,
lm_gradient_descent(df,0.0001,,0)
the results I got were
c(0.9165891,0.6115482,0.5652970)
when I use the normal equation method, I get
c(2,1,0).
Hope someone can shed some light on where I went wrong in this function.
You used the stopping criterion
old parameters - new parameters <= 0.0000001
First of all I think there's an abs() missing if you want to use this criterion (though my ignorance of R may be at fault).
But even if you use
abs(old parameters - new parameters) <= 0.0000001
this is not a good stopping criterion: it only tells you that progress has slowed down, not that it's already sufficiently accurate. Try instead simply to iterate for a fixed number of iterations. Unfortunately it's not that easy to give a good, generally applicable stopping criterion for gradient descent here.
It seems that you have not implemented a bias term. In a linear model like this, you always want to have an additional additive constant, i.e., your model should be like
w_0 + w_1*x_1 + ... + w_n*x_n.
Without the w_0 term, you usually won't get a good fit.
I know this is a couple of weeks old at this point but I'm going to take a stab at for several reasons, namely
Relatively new to R so deciphering your code and rewriting it is good practice for me
Working on a different Gradient Descent problem so this is all fresh to me
Need the stackflow points and
As far as I can tell you never got a working answer.
First, regarding your data structures. You start with a dataframe, rename a column, strip out a vector, then strip out a matrix. It would be a lot easier to just start with an X matrix (capitalized since its component 'features' are referred to as xsubscript i) and a y solution vector.
X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)
We can easily see what the desired solutions are, with and without scaling by fitting a linear fit model. (NOTE We only scale X/features and not y/solutions)
> lm(y~X)
Call:
lm(formula = y ~ X)
Coefficients:
(Intercept) X1 X2
-4 -1 3
> lm(y~scale(X))
Call:
lm(formula = y ~ scale(X))
Coefficients:
(Intercept) scale(X)1 scale(X)2
6.000 -2.646 4.583
With regards to your code, one of the beauties of R is that it can perform matrix multiplication which is significantly faster than using loops.
lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){
if(scale==TRUE){X <- scale(X)}
X <- cbind(1, X)
theta <- rep(0, ncol(X)) #your old temp_arr
diff <- 1
old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
while(diff>0.000000001){
theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
diff <- abs(old.error - new.error)
old.error <- new.error
}
return(theta)
}
And to show it works...
> lm_gradient_descent(X, y, .01, 0)
[,1]
[1,] -3.9360685
[2,] -0.9851775
[3,] 2.9736566
vs expected of (-4, -1, 3)
For what its worth while I agree with #cfh that I would prefer a loop with a defined number of iterations, I'm actually not sure you need the abs function. If diff < 0 then your function is not converging.
Finally rather than using something like old.error and new.error I'd suggest using a a vector that records all errors. You can then plot that vector to see how quickly your function converges.

A function for calculating the eigenvalues of a matrix in R

I want to write a function like eigen() to calculating eigenvalues and eigenvectors of an arbitary matrix. I wrote the following codes for calculation of eigenvalues and I need a function or method to solve the resulted linear equation.
eig <- function(x){
if(nrow(x)!=ncol(x)) stop("dimension error")
ff <- function(lambda){
for(i in 1:nrow(x)) x[i,i] <- x[i,i] - lambda
}
det(x)
}
I need to solve det(x)=0 that is a polynomial linear equation to find the values of lambda. Is there any way?
Here is one solution using uniroot.all:
library(rootSolve)
myeig <- function(mat){
myeig1 <- function(lambda) {
y = mat
diag(y) = diag(mat) - lambda
return(det(y))
}
myeig2 <- function(lambda){
sapply(lambda, myeig1)
}
uniroot.all(myeig2, c(-10, 10))
}
R > x <- matrix(rnorm(9), 3)
R > eigen(x)$values
[1] -1.77461906 -1.21589769 -0.01010515
R > myeig(x)
[1] -1.77462211 -1.21589767 -0.01009019
Computing determinant is such a bad idea as it is not numerically stable. You can easily get Inf etc even for a moderately big matrix. I suggest reading the following answers (read them otherwise you have no idea what my code is doing):
Are eigenvectors returned by R function eigen() wrong?
eigenvectors when A-lx is singular with no solution
then use either of the following
NullSpace(A - diag(lambda, nrow(A)))
nullspace(A - diag(lambda, nrow(A)))
The solution from #liuminzhao won't work if there is two repeated eigenvalues. The function will fail to find the roots, because the characteristic polynomial of the matrix will not change sign (it is zero and does not cross the zero line), which is what rootSolve::uniroot.all() is doing when looking for roots. So you need another way to find a local minima (like optim()). Moreover, it will failed to determine the number of repeated eigenvalues.
A better way is to find the characteristic equation with, which is easily done with pracma::charpoly() and then using polyroot().
par <- pracma::charpoly(M) # find parameters of the CP of matrix M
par <- par[length(par):1] # reverse order for polyroot()
roots <- Re(polyroot(par)) # keep real part of the polyroot()
The pracma::charpoly() is not too complicated in itself, see its source code, starting at line a1 <- a.

Problems with Newton's Method for finding coefficient and Hessian

I am trying to write a function that uses Newton's method (coefficients+(inverse hessian)*gradient) to iteratively find the coefficients for a loglinear model.
I am using the following code:
##reading in the data
dat<-read.csv('hw8.csv')
summary(dat)
# data file containing yi and xi
attach(dat)
##Creating column of x's
x<-cbind(1,xi)
mle<-function(c){
gi<- 1-yi*exp(c[1]+c[2]*xi)
hi<- gi-1
H<- -1*(t(x)%*%hi%*%x)
g<-t(x)%*%gi
c<-c+solve(H)%*%g
return(c)
}
optim(c(0,1),mle,hessian=TRUE)
When I run the code, I get the following error:
Error in t(x) %*% hi %*% x : non-conformable arguments
RMate stopped at line 29
Given that the formula is drawn from Bill Greene's problem set, I don't think it is a formula problem. I think I am doing something wrong in passing my function.
How can I fix this?
Any help with this function would be much appreciated.
As Jonathan said in the comments, you need proper dimensions:
R> X <- matrix(1:4, ncol=2)
R> t(X) %*% X
[,1] [,2]
[1,] 5 11
[2,] 11 25
R>
But you also should use the proper tools so maybe look at the loglin function in the stats package, and/or the loglm function in the MASS package. Both will be installed by default with your R installation.

Resources