Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors - r

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:
require(MASS)
require(car)
# Example data
q <- 10
n <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))
# Fit lm
fit <- lm( Y ~ ., data = X )
# Type I sums of squares
summary(manova(fit))$SS
# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP
This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),
# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r)
# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y
car still goes faster for large n:
I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.
Any hint on how to do this faster?
UPDATE
After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:
Unit: milliseconds
expr min lq mean median uq max neval
ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343 10
QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508 10
SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318 10
Chol 969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274 10
CAR 205.1665 211.8523 218.6195 214.6761 222.0973 242.4617 10
UPDATE 2
The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.
The relevant code by car is as follows (I skipped checks, and simplified a bit):
B <- coef(fit) # Model coefficients
M <- model.matrix(fit) # Model matrix M
V <- solve(crossprod(M)) # M'M
p <- ncol(M) # Number of predictors in M
I.p <- diag(p) # Identity (p x p)
terms <- labels(terms(fit)) # terms (add intercept)
terms <- c("(Intercept)", terms)
n.terms <- length(terms)
assign <- fit$assign # assignation terms <-> p variables
SSP <- as.list(rep(0, n.terms)) # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
for (term in 1:n.terms){
subs <- which(assign == term - 1)
L <- I.p[subs, , drop = FALSE]
SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}
Then it is just a matter of selecting the subset of terms.

This line and the similar one below it for H1 could probably be improved:
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)
The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.
You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

Related

compact/efficient replacement for diag(X V X^T)?

When making predictions for a linear statistical model we usually have a model matrix X of predictors corresponding to the points at which we want to make predictions; a vector of coefficients beta; and a variance-covariance matrix V. Computing the predictions is just X %*% beta. The most straightforward way to compute the variances of the predictions is
diag(X %*% V %*% t(X))
or slightly more efficiently
diag(X %*% tcrossprod(V,X))
However, this is very inefficient, because it constructs an n*n matrix when all we really want is the diagonal. I know I could write some Rcpp-loopy thing that would compute just the diagonal terms, but I'm wondering if there is an existing linear algebra trick in R that will nicely do what I want ... (if someone wants to write the Rcpp-loopy thing for me as an answer I wouldn't object, but I'd prefer a pure-R solution)
FWIW predict.lm seems to do something clever by multiplying X by the inverse of the R component of the QR-decomposition of the lm; I'm not sure that's always going to be available, but it might be a good starting point (see here)
Along the lines of this Octave/Matlab question, for two matrices A and B, we can use the use the fact that the nth diagonal entry of AB will be the product of the nth row of A with the nth column of B. We can naively extend that to the case of three matrices, ABC. I have not considered how to optimize in the case where C=A^T, but aside from that, this code looks like promising speedup:
start_time <- Sys.time()
A=matrix(1:1000000, nrow = 1000, ncol = 1000)
B=matrix(1000000:1, nrow = 1000, ncol = 1000)
# Try one of these two
res=diag(A %*% B %*% t(A)) # ~0.47s
res=rowSums(A * t(B %*% t(A))) # ~0.27s
end_time <- Sys.time()
print(end_time - start_time)
Using tcrossprod did not appear to accelerate the results when I ran this code. However, just using the row-sum-dot-product approach appears to be a lot more efficient already, at least on this silly example, which suggests (though I'm not sure) that rowSums is not computing the full intermediate matrices before returning the diagonal entries, as I'd expect happens with diag.
I am not quite sure how efficient this is,
Find U such that V = U %*% t(U); this is possible since V is cov matrix.
XU = X %*% U
result = apply(XU, 1, function(x) sum(x^2))
Demo
V <- cov(iris[, -5])
X <- as.matrix(iris[1:5, -5])
Using SVD
svd_v <- svd(V)
U <- svd_v$u %*% diag(sqrt(svd_v$d))
XU = X %*% U
apply(XU, 1, function(x) sum(x^2))
# 1 2 3 4 5
#41.35342 39.36286 35.42369 38.25584 40.30839
Another approach - this isn't also going to be faster than #davewy's
U <- chol(V)
XU = (X %*% U)^2
rowSums(XU)
I recently found emulator::quad.diag(), which is just
colSums(crossprod(M, Conj(x)) * x)
This is slightly better than #davewy's solution (although the overall differences are less than I thought they would be anyway).
library(microbenchmark)
microbenchmark(full=diag(A %*% B %*% t(A)),
davewy=rowSums(A * t(B %*% t(A))),
emu = quad.diag(A,B))
Unit: milliseconds
expr min lq mean median uq max neval cld
full 32.76241 35.49665 39.51683 37.63958 41.46561 57.41370 100 c
davewy 22.74787 25.06874 28.42179 26.97330 29.68895 45.38188 100 b
emu 17.68390 20.21322 23.59981 22.09324 24.80734 43.60953 100 a

Fast nonnegative quantile and Huber regression in R

I am looking for a fast way to do nonnegative quantile and Huber regression in R (i.e. with the constraint that all coefficients are >0). I tried using the CVXR package for quantile & Huber regression and the quantreg package for quantile regression, but CVXR is very slow and quantreg seems buggy when I use nonnegativity constraints. Does anybody know of a good and fast solution in R, e.g. using the Rcplex package or R gurobi API, thereby using the faster CPLEX or gurobi optimizers?
Note that I need to run a problem size like below 80 000 times, whereby I only need to update the y vector in each iteration, but still use the same predictor matrix X. In that sense, I feel it's inefficient that in CVXR I now have to do obj <- sum(quant_loss(y - X %*% beta, tau=0.01)); prob <- Problem(Minimize(obj), constraints = list(beta >= 0)) within each iteration, when the problem is in fact staying the same and all I want to update is y. Any thoughts to do all this better/faster?
Minimal example:
## Generate problem data
n <- 7 # n predictor vars
m <- 518 # n cases
set.seed(1289)
beta_true <- 5 * matrix(stats::rnorm(n), nrow = n)+20
X <- matrix(stats::rnorm(m * n), nrow = m, ncol = n)
y_true <- X %*% beta_true
eps <- matrix(stats::rnorm(m), nrow = m)
y <- y_true + eps
Nonnegative quantile regression using CVXR :
## Solve nonnegative quantile regression problem using CVX
require(CVXR)
beta <- Variable(n)
quant_loss <- function(u, tau) { 0.5*abs(u) + (tau - 0.5)*u }
obj <- sum(quant_loss(y - X %*% beta, tau=0.01))
prob <- Problem(Minimize(obj), constraints = list(beta >= 0))
system.time(beta_cvx <- pmax(solve(prob, solver="SCS")$getValue(beta), 0)) # estimated coefficients, note that they ocasionally can go - though and I had to clip at 0
# 0.47s
cor(beta_true,beta_cvx) # correlation=0.99985, OK but very slow
Syntax for nonnegative Huber regression is the same but would use
M <- 1 ## Huber threshold
obj <- sum(CVXR::huber(y - X %*% beta, M))
Nonnegative quantile regression using quantreg package :
### Solve nonnegative quantile regression problem using quantreg package with method="fnc"
require(quantreg)
R <- rbind(diag(n),-diag(n))
r <- c(rep(0,n),-rep(1E10,n)) # specify bounds of coefficients, I want them to be nonnegative, and 1E10 should ideally be Inf
system.time(beta_rq <- coef(rq(y~0+X, R=R, r=r, tau=0.5, method="fnc"))) # estimated coefficients
# 0.12s
cor(beta_true,beta_rq) # correlation=-0.477, no good, and even worse with tau=0.01...
To speed up CVXR, you can get the problem data once in the beginning, then modify it within a loop and pass it directly to the solver's R interface. The code for this is
prob_data <- get_problem_data(prob, solver = "SCS")
Then, parse out the arguments and pass them to scs from the scs library. (See Solver.solve in solver.R). You'll have to dig into the details of the canonicalization, but I expect if you're just changing y at each iteration, it should be a straightforward modification.

In R, how does one extract the hat/projection/influence matrix or values from an `nls` model object?

For lm or glm type objects, or even lmer type objects, you can extract the hat values from the model by using the R function hatvalues(). However, this doesn't work with nls objects, apparently. I have Googled every which way, but I can't find a way to get these values. Does nls simply not create a hat matrix, or are the hat values produced from a non-linear least squares model just not reliable somehow?
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
There's a nice article (On the outlier Detection in Nonlinear
Regression) where hat matrix is approximated by gradient matrix computed
at estimated point.
In your case:
# gradient of the model function at the current parameter values
V <- nls1$m$gradient()
# tangent plane leverage matrix (it plays a similar role as the Hat matrix)
H <- V %*% solve(t(V) %*% V) %*% t(V)
# 'hat' values for nls
nls1.hat_values <- diag(H)
And if you follow this article you can calculate H a little bit faster:
Q1 <- qr.Q(qr(V)) # V is the same matrix as above
H <- Q1 %*% t(Q1)
Since H can be quite big and if you want only hat values you can skip matrix multiplication altogether. We only need diagonal of H matrix.
###
#' Approximation of hat values for nls.
#'
#' #param model An 'nls' object
#' #param ... Additional parameters (ignored)
#' #return Vector of approximated hat values
###
hatvalues.nls <- function(model, ...) {
stopifnot(is(model, 'nls'))
list(...) # ignore additional parameters
V <- model$m$gradient()
Q1 <- qr.Q(qr(V))
rowSums(Q1*Q1)
}

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

PCA Feature selection using R

I am a biologist. An output of my experiment contains large number of features(which are stored as numbers of columns and 563 rows). The columns are the features which are 8603 in number which are quite high.
So, when I tried to do PCA analysis in R and it gives "out of memory" errors.
I have tried also doing princomp in pieces, but it does not seem to work for our
approach.
I tried using the Script given in the link...
http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/
But still it does not wok :(
I am trying to use the following code
bumpus <- read.table("http://www.ndsu.nodak.edu/ndsu/doetkott/introsas/rawdata/bumpus.html",
skip=20, nrows=49,
col.names=c("id","total","alar","head","humerus","sternum"))
boxplot(bumpus, main="Boxplot of Bumpus' data") ## in this step it is showing the ERROR
# we first standardize the data:
bumpus.scaled <- data.frame( apply(bumpus,2,scale) )
boxplot(bumpus.scaled, main="Boxplot of standardized Bumpus' data")
pca.res <- prcomp(bumpus.scaled, retx=TRUE)
pca.res
# note:
# PC.1 is some kind of average of all the measurements
# => measure of size of the bird
# PC.2 has a negative weight for 'sternum'
# and positive weights for 'alar', 'head' and 'humerus'
# => measure of shape of the bird
# first two principal components:
pca.res$x[,1:2]
plot(pca.res$x[,1:2], pch="", main="PC.1 and PC.2 for Bumpus' data (blue=survived, red=died)")
text(pca.res$x[,1:2], labels=c(1:49), col=c(rep("blue",21),rep("red",28)))
abline(v=0, lty=2)
abline(h=0, lty=2)
# compare to segment plot:
windows()
palette(rainbow(12, s = 0.6, v = 0.75))
stars(bumpus, labels=c(1:49), nrow=6, key.loc=c(20,-1),
main="Segment plot of Bumpus' data", draw.segment=TRUE)
# compare to biplot:
windows()
biplot(pca.res, scale=0)
# what do the arrows mean?
# consider the arrow for sternum:
abline(0, pca.res$rotation[5,2]/pca.res$rotation[5,1])
# consider the arrow for head:
abline(0, pca.res$rotation[3,2]/pca.res$rotation[3,1])
But second line
boxplot(bumpus, main="Boxplot of Bumpus' data") ## shows an error
The error is
Error: cannot allocate vector of size 1.4 Mb
In addition: There were 27 warnings (use warnings() to see them)
Please help!
In cases where the number of features is either huge or exceeds the number of
observations, it is well advised to calculate the principal components based on
the transposed dataset. This is especially true in your case because the default
implies calculation of a 8603 x 8603 covariance matrix which itself already
consumes about 500 MB of memory (oh well, this isn't too much, but hey...).
Assuming that the rows of your matrix X correspond to observations
and columns correspond to features, center your data and then perform PCA on the
transpose of the centered X. There won't be more eigenpairs than number of
observations anyway. Finally, multiply each resulting eigenvector by X^T. You do
not need to do the latter for the eigenvalues (see way below for a detailed explanation):
What you want
This code demonstrates the implementation of PCA on the transposed dataset and compares the results of prcomp and the "transposed PCA":
pca.reduced <- function(X, center=TRUE, retX=TRUE) {
# Note that the data must first be centered on the *original* dimensions
# because the centering of the 'transposed covariance' is meaningless for
# the dataset. This is also why Sigma must be computed dependent on N
# instead of simply using cov().
if (center) {
mu <- colMeans(X)
X <- sweep(X, 2, mu, `-`)
}
# From now on we're looking at the transpose of X:
Xt <- t(X)
aux <- svd(Xt)
V <- Xt %*% aux$v
# Normalize the columns of V.
V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
# Done.
list(X = if (retX) X %*% V else NULL,
V = V,
sd = aux$d / sqrt(nrow(X)-1),
mean = if (center) mu else NULL)
}
# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)
original <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)
# See what happens:
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
>
> print(original$rotation)
PC1 PC2 PC3
[1,] -0.0055505001 0.0067322416 0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,] 0.0003650635 -0.9999772572 0.006734371
> print(transposed$V)
[,1] [,2] [,3]
[1,] 0.0055505001 0.0067322416 -0.999961934
[2,] 0.9999845292 -0.0004024287 0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371
Details
To see why it is possible to work on the transposed matrix consider the
following:
The general form of the eigenvalue equation is
A x = λ x (1)
Without loss of generality, let M be a centered "copy" of your original
dataset X. Substitution of M^T M for A yields
M^T M x = λ x (2)
Multiplication of this equation by M yields
M M^T M x = λ M x (3)
Consequent substitution of y = M x yields
M M^T y = λ y (4)
One can already see that y corresponds to an eigenvector of the "covariance"
matrix of the transposed dataset (note that M M^T is in fact no real
covariance matrix as the dataset X was centered along its columns and not its
rows. Also, scaling must be done by means of the number of samples (rows of M)
and not the number of features (columns of M resp. rows of M^T).
It can also be seen that the eigenvalues are the same for M M^T and M^T M.
Finally, one last multiplication by M^T results in
(M^T M) M^T y = λ M^T y (5)
where M^T M is the original covariance matrix.
From equation (5) it follows that M^T y is an eigenvector of M^T M with
eigenvalue λ.

Resources