Out of memory when using `outer` in solving my big normal equation for least squares estimation - r

Consider the following example in R:
x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- t(g) %*% g
gginv <- solve(gg)
bigmatrix <- outer(x1, x2, "<=")
Gw <- t(g) %*% bigmatrix
beta <- gginv %*% Gw
w1 <- bigmatrix - g %*% beta
If I try to run such a thing in my computer, it will throw a memory error (because the bigmatrix is too big).
Do you know how can I achieve the same, without running into this problem?

This is a least squares problem with 100,000 responses. Your bigmatrix is the response (matrix), beta is the coefficient (matrix), while w1 is the residual (matrix).
bigmatrix, as well as w1, if formed explicitly, will each cost
(100,000 * 100,000 * 8) / (1024 ^ 3) = 74.5 GB
This is far too large.
As estimation for each response is independent, there is really no need to form bigmatrix in one go and try to store it in RAM. We can just form it tile by tile, and use an iterative procedure: form a tile, use a tile, then discard it. For example, the below considers a tile of dimension 100,000 * 2,000, with memory size:
(100,000 * 2,000 * 8) / (1024 ^ 3) = 1.5 GB
By such iterative procedure, the memory usage is effectively under control.
x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- crossprod(g) ## don't use `t(g) %*% g`
## we also don't explicitly form `gg` inverse
## initialize `beta` matrix (4 coefficients for each of 100,000 responses)
beta <- matrix(0, 4, 100000)
## we split 100,000 columns into 50 tiles, each with 2000 columns
for (i in 1:50) {
start <- 2000 * (i-1) + 1 ## chunk start
end <- 2000 * i ## chunk end
bigmatrix <- outer(x1, x2[start:end], "<=")
Gw <- crossprod(g, bigmatrix) ## don't use `t(g) %*% bigmatrix`
beta[, start:end] <- solve(gg, Gw)
}
Note, don't try to compute the residual matrix w1, as It will cost 74.5 GB. If you need residual matrix in later work, you should still try to break it into tiles and work one by one.
You don't need to worry about the loop here. The computation inside each iteration is costly enough to amortize looping overhead.

Related

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

R Indexing, Matrix multiplication

I seem to have a misunderstanding about memory usage when using a subset of a matrix in R. I came across when I tried to program a cross validation function, but I think the problem is more general. I have cooked up a small example below.
# parameters
n <- 1e6 # the real data are much bigger, but this will do
m <- 50
nfolds <- 10
X <- matrix(rnorm(n*m,0,1),nrow=n,ncol=m)
y <- rnorm(n,0,1)
mse <- rep(0,nfolds)
foldid <- sample(rep(seq(nfolds), length = n))
# produces big spikes in memory
for (i in (1:nfolds)) {
which <- foldid == i
xpx <- crossprod(X[!which,])
xpy <- crossprod(X[!which,],y[!which])
b <- solve(xpx,xpy)
mse[i] <- mean((y[which] - X[which,] %*% b)**2)
}
# does not produce spikes in memory usage
for (i in (1:nfolds)) {
xpx <- crossprod(X)
xpy <- crossprod(X,y)
b <- solve(xpx,xpy)
mse[i] <- mean((y - X %*% b)**2)
}
I don't understand why the first loop produces big upward spikes in memory usage, whereas the second loop doesn't although a strictly larger matrix is multiplied.
Let's compare the first lines withing the loops.
First, the simple crossprod:
xpx <- crossprod(X)
Without subsetting, you work with matrices X (already existing 400 MB) and xpx (small).
Second, with subsetting:
xpx <- crossprod(X[!which,])
Here you work with X, temporary matrix X[!which,], and xpx. The additional matrix X[!which,] requires additional 360 MB of memory.
object.size(X[!which,])
# 360000200 bytes
R has relatively poor memory managment, so the temporary matrix may not be discarded for some time.

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

R add series of vectors to matrix and operate function with each addition

I have the following conundrum. The following piece of code, takes a vector, from a pool of vectors, rbinds the vector to the matrix and performs the function on the new matrix and returns a scalar result.
In2 <- diag(nXtr+1)
mu <- array(1,c(dim(Xcal)[1],1))
Y.hat.calib <- array(0,c(nC,1))
alpha <- array(0,c(nC,1))
P = c()
for (i in 1:dim(Xcal)[1]){
Xtr2 <- rbind(Xtr,Xcal[i,])
K2 <-(Xtr2%*%t(Xtr2)+1)^2
rowCnt <- dim(Xtr2)[1]
mu[i] <- sqrt(1 + t(c(rep(1,(rowCnt-1)),0))%*%solve(K2+a*In2)%*%K2%*%c(rep(0,(rowCnt-1)),1))
#---------------------------------------------------------------------
Y.hat.calib[i] <- kCal[,i]%*%solve(K + a*In)%*%Ytr
alpha[i] <- (abs(Y.hat.calib[i] - Ycal[i]))/mu[i]
P <- c(P,alpha[i])
#---------------------------------------------------------------------
I have pre-allocated where needed, but really need to get rid of the loop as its too time consuming. I have played around with various ideas, bu cannot come up with a way to do this.
Any help would be gratefully received as always. if there is anything I have missed please let me know.
Getting rid of for loops won't automatically make things faster. The biggest change you can make to your code is avoiding calculating solve whenever possible; solve is very computationally intense.
I didn't try to make sure that this was bug-free, because you hadn't provided sample data. But you can follow the general idea: don't do the solve in every loop, logically separate your calculation of mu from alpha, and replace column-by-column functions with matrix multiplication where you can.
# Your mu boils down to this:
get.mu <- function(Xcal.i, Xtr2=Xtr2, a.times.In2=a.times.In2) {
Xtr2 <- rbind(Xtr, Xcal.i)
K2 <-(Xtr2 %*% t(Xtr2) + 1)^2
first.solve <- solve(K2 + a.times.In2) %*% K2
bread <- c(rep(1, nrow(Xtr2-1)), 0)
sqrt(1 + t(bread) %*% first.solve %*% bread) # mu
}
In2 <- diag(nXtr+1)
# Constants that you recalculated every loop.
a.times.In2 <- a*In2
second.solve <- solve(K + a*In ) %*% Ytr
# Y.hat.calib can be fully calculated in one matrix multiplication.
Y.hat.calib <- kCal %*% second.solve
# Which means that the difference is a constant:
Y.diff <- abs(Y.hat.calib - Ycal)
# So alpha and mu could be calculated like:
mu <- apply(X.cal, 2, get.mu)
alpha <- t(t(Y.diff) / mu.i)

R optimize not giving the finite minimum but Inf when the search interval is wider

I have a problem with optimize().
When I limit the search in a small interval around zero, e.g., (-1, 1), the optimize algorithm gives a finite minimum with a finite objective function value.
But when I make the interval wider to (-10, 10), then the minimum is on the boundary of the interval and the objective is Inf, which is really puzzling for me.
How can this happen and how to fix this? Thanks a lot in advance.
The following is my code.
set.seed(123)
n <- 120
c <- rnorm(n,mean=1,sd=.3);
eps <- rnorm(n,mean=0,sd=5)
tet <- 32
r <- eps * c^tet
x <- matrix(c(c,r), ncol=2)
g <- function(tet, x){
matrix((x[,1]^(-tet))*x[,2],ncol=1)
}
theta <- 37
g_t <- g(theta,x)
f.tau <- function(tau){
exp.tau.g <- exp(g_t %*% tau)
g.exp <- NULL; i <- 1:n
g.exp <- matrix(exp.tau.g[i,] * g_t[i,], ncol=1)
sum.g.exp <- apply(g.exp,2,sum)
v <- t(sum.g.exp) %*% sum.g.exp
return(v)
}
band.tau <- 1;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-1, 1)"); print(f);
band.tau <- 10;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-10, 10)"); print(f);
The problem is that your function f.tau(x) is not well behaved. You can see that here:
vect.f <- Vectorize(f.tau)
z1 <- seq(-1,1,by=0.01)
z10 <- seq(-10,10,by=0.01)
par(mfrow=c(2,1), mar=c(2,2,1,1))
plot(z1, log(vect.f(z1)), type="l")
plot(z10,log(vect.f(z10)),type="l")
Note that these are plots of log(f.tau). So there are two problems: f.tau(...) has an extremely large slope on either side of the minimum, and f.tau = Inf for x<-0.6 and x>1.0, where Inf means that f.tau(...) is greater than the largest number that can be represented on this system. When you set the range to (-1,1) your starting point is close enough to the minimum that optimize(...) manages to converge. When you set the limits to (-10,10) the starting point is too far away. There are examples in the documentation which show a similar problem with functions that are not nearly as ill-behaved as f.tau.
EDIT (Response to OP's comment)
The main problem is that you are trying to optimize a function which has computational infinities in the interval of interest. Here's a way around that.
band.tau <- 10
z <- seq(-band.tau,band.tau,length=1000)
vect.f <- Vectorize(f.tau)
interval <- range(z[is.finite(vect.f(z))])
f <- optimize(f.tau, interval, tol=1e-20)
f
# $minimum
# [1] 0.001615433
#
# $objective
# [,1]
# [1,] 7.157212e-12
This evaluates f.tau(x) at 1000 equally spaced points on (-band.tau,+band.tau), identifies all the values of x where f.tau is finite, and uses the range as the increment in optimize(...). This works in your case because f.tau(x) does not (appear to...) have asymptotes.

Resources