Matrix computation with for loop - r

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}

Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Related

Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:
require(MASS)
require(car)
# Example data
q <- 10
n <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))
# Fit lm
fit <- lm( Y ~ ., data = X )
# Type I sums of squares
summary(manova(fit))$SS
# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP
This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),
# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r)
# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y
car still goes faster for large n:
I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.
Any hint on how to do this faster?
UPDATE
After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:
Unit: milliseconds
expr min lq mean median uq max neval
ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343 10
QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508 10
SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318 10
Chol 969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274 10
CAR 205.1665 211.8523 218.6195 214.6761 222.0973 242.4617 10
UPDATE 2
The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.
The relevant code by car is as follows (I skipped checks, and simplified a bit):
B <- coef(fit) # Model coefficients
M <- model.matrix(fit) # Model matrix M
V <- solve(crossprod(M)) # M'M
p <- ncol(M) # Number of predictors in M
I.p <- diag(p) # Identity (p x p)
terms <- labels(terms(fit)) # terms (add intercept)
terms <- c("(Intercept)", terms)
n.terms <- length(terms)
assign <- fit$assign # assignation terms <-> p variables
SSP <- as.list(rep(0, n.terms)) # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
for (term in 1:n.terms){
subs <- which(assign == term - 1)
L <- I.p[subs, , drop = FALSE]
SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}
Then it is just a matter of selecting the subset of terms.
This line and the similar one below it for H1 could probably be improved:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.
You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

How can I simulate m random samples of size n from a given distribution with R?

I know how to generate a random sample of size n from a standard statistical distribution, say exponential. But if I want to generate m such random samples of size n (i.e. m vectors of dimension n) how can I do it?
To create a n by m matrix containing m samples of size n you can use:
x <- replicate(m, rnorm(n, ...))
Obviously substituting rnorm with other distributions if desired. If you then want to store these in separate individual vectors then you can use
v <- x[ , i]
This puts the ith column of x into v, which corresponds to the ith sample. It may be easier/quicker to just use a simple for loop altogether though:
for(i in 1:m){
name <- paste("V", i, sep = "")
assign(name, rnorm(n, ...))
}
This generates a random sample at each iteration, and for stage i, names the sample Vi. By the end of it you'll have m random samples named V1, V2, ..., Vm.

Alternative way to loop for faster computing

Let's say I have a n*p dataframe.
I have computed a list of n matrix of p*p dimensions (named listMat in the R script belowed), in which each matrix is the distance matrix between the p variables for each of the n respondants.
I want to compute a n*n matrix called normMat, with each elements corresponding to the norm of the difference between each pairwise distance matrix. For exemple : normMat[1,2] will be the norm of a matrix named "diffMat" where diffMat is the difference between the 1st distance matrix and the 2nd distance matrix of the list of Matrix "listMat".
I wrote the following script which works fine, but i'm wondering if there is a more efficient way to write it, to avoid the loops (using for exemple lapply, etc ..) and make the script execution go faster.
# exemple of n = 3 distances matrix between p = 5 variables
x <- abs(matrix(rnorm(1:25),5,5))
y <- abs(matrix(rnorm(1:25),5,5))
z <- abs(matrix(rnorm(1:25),5,5))
listMat <- list(x, y, z)
normMat <- matrix(NA,n,n)
for (numRow in 1:n){
for (numCol in 1:n){
diffMat <- listMat[[numRow]] - listMat[[numCol]]
normMat[numRow, numCol] <- norm(diffMat, type="F")
}
}
Thanks for your help.
Try:
normMat <- function(x, y) {
norm(x-y, type="F")
}
sapply(listMat, function(x) sapply(listMat, function(y) normMat(x,y)))

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

Resources