Can someone explain to me why the calculations becomes so much slower when I add arma::mat P(X * arma::inv(X.t() * X) * X.t()); to my code. The mean grew with a factor 164 last time I benchmarked the code.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
//[[Rcpp::export]]
List test1(DataFrame data, Language formula, String y_name) {
Function model_matrix("model.matrix");
NumericMatrix x_rcpp = model_matrix(formula, data);
NumericVector y_rcpp = data[y_name];
arma::mat X(x_rcpp.begin(), x_rcpp.nrow(), x_rcpp.ncol());
arma::colvec Y(y_rcpp.begin(), y_rcpp.size());
arma::colvec coef = inv(X.t() * X) * X.t() * Y;
arma::colvec resid = Y - X * coef;
arma::colvec fitted = X * coef;
DataFrame data_res = DataFrame::create(_["Resid"] = resid,
_["Fitted"] = fitted);
return List::create(_["Results"] = coef,
_["Data"] = data_res);
}
//[[Rcpp::export]]
List test2(DataFrame data, Language formula, String y_name) {
Function model_matrix("model.matrix");
NumericMatrix x_rcpp = model_matrix(formula, data);
NumericVector y_rcpp = data[y_name];
arma::mat X(x_rcpp.begin(), x_rcpp.nrow(), x_rcpp.ncol());
arma::colvec Y(y_rcpp.begin(), y_rcpp.size());
arma::colvec coef = inv(X.t() * X) * X.t() * Y;
arma::colvec resid = Y - X * coef;
arma::colvec fitted = X * coef;
arma::mat P(X * arma::inv(X.t() * X) * X.t());
DataFrame data_res = DataFrame::create(_["Resid"] = resid,
_["Fitted"] = fitted);
return List::create(_["Results"] = coef,
_["Data"] = data_res);
}
/*** R
data <- data.frame(Y = rnorm(10000), X1 = rnorm(10000), X2 = rnorm(10000), X3 = rnorm(10000))
microbenchmark::microbenchmark(test1(data, Y~X1+X2+X3, "Y"),
test2(data, Y~X1+X2+X3, "Y"), times = 10)
*/
Best regards,
Jakob
What you are doing is awfully close to fastLm() which I revised many times over the years. From that we can draw a few conclusions:
Don't X (X' X)^1 X' directly. Use solve().
Don't ever work off a formula object. Use a matrix and vector for X and y.
Here is benchmark example illustrating how parsing the formula destroys all gains from the matrix algebra.
As an aside, R itself has pivoted operations for rank-deficient matrix. That help with deformed matrices; in many "normal" cases you should be ok.
Great question. Not entirely sure why the speed increase outside of a few notes that I've made. So, be warned.
Consider the n being used here is 10000 with the p being 3.
Let's look at the operations requested. We'll start with the coef or beta_hat operation:
Beta_[p x 1] = (X^T_[p x n] * X_[n x p])^(-1) * X^T_[p x n] * Y_[n x 1]
Looking at the P or projection / hat matrix:
P_[n x n] = X_[n x p] * (X^T_[p x n] * X_[n x p])^(-1) * X^T_[p x n]
So, the N matrix here is sufficiently larger than the prior matrix. Matrix multiplication is generally governed by O(n^3) (the naive schoolbook multiplication). So, potentially, this can explain the large increment in time.
Outside of that, there are repetitive calculations involving
(X^T_[p x n] * X_[n x p])^(-1) * X^T_[p x n]
within test2 causing it to be recomputed. The main issue here is the inverse being the most expensive operation.
Also, regarding the use of inv the API entry indicates that:
if matrix A is know to be symmetric positive definite, using inv_sympd() is faster
if matrix A is know to be diagonal, use inv( diagmat(A) )
to solve a system of linear equations, such as Z = inv(X)*Y, using solve() is faster and more accurate
The third point is particular of interest in this case as it gives a more optimized routine for inv(X.t() * X)*X.t() => solve(X.t() * X, X.t())
Related
I have a function which takes a vector as input and outputs a scalar and I want to apply this function to a number of observations. The data is structured in a matrix (rows are the number of observations and columns the variables) and the function is:
// [[Rcpp::export]]
double gaussianweight(arma::vec x, arma::mat H) {
double c = std::pow(2 * arma::datum::pi, -0.5 * x.n_rows);
double s = std::pow(arma::det(H), -1);
arma::mat Hinv = arma::inv(H);
return(c * s * std::exp(-0.5 * arma::dot(Hinv * x, Hinv * x)));
}
to every row vector of a arma::mat X. How would I do that efficiently? A loop that lopps over the rows of X or are there better solutions? I use R for the most time and really got used to avoid loops whenever it is possible. I tried the .each_row() operations but had no luck...
Say I have prices of a stock and I want to find the slope of the regression line in rolling manner with a given window size. How can I get it done in Julia? I want it to be really fast hence don't want to use a for loop.
You should not, in general, be worried about for loops in Julia, as they do not have the overhead of R or Python for loops. Thus, you only need to worry about asymptotic complexity and not the potentially large constant factor introduced by interpreter overhead.
Nevertheless, this operation can be done much more (asymptotically) efficiently with convolutions than with the naïve O(n²) slice-and-regress approach. The DSP.jl package provides convolution functionality. The following is an example with no intercept (it computes the rolling betas); support for an intercept should be possible by modifying the formulas.
using DSP
# Create some example x (signal) and y (stock prices)
# such that strength of signal goes up over time
const x = randn(100)
const y = (1:100) .* x .+ 100 .* randn(100)
# Create the rolling window
const window = Window.rect(20)
# Compute linear least squares estimate (X^T X)^-1 X^T Y
const xᵗx = conv(x .* x, window)[length(window):end-length(window)+1]
const xᵗy = conv(x .* y, window)[length(window):end-length(window)+1]
const lls = xᵗy ./ xᵗx # desired beta
# Check result against naïve for loop
const βref = [dot(x[i:i+19], y[i:i+19]) / dot(x[i:i+19], x[i:i+19]) for i = 1:81]
#assert isapprox(βref, lls)
Edit to add: To support an intercept, i.e. X = [x 1], so X^T X = [dot(x, x) sum(x); sum(x) w] where w is the window size, the formula for inverse of a 2D matrix can be used to get (X^T X)^-1 = [w -sum(x); -sum(x) dot(x, x)]/(w * dot(x, x) - sum(x)^2). Thus, [β, α] = [w dot(x, y) - sum(x) * sum(y), dot(x, x) * sum(y) - sum(x) * dot(x, y)] / (w * dot(x, x) - sum(x)^2). This can be translated to the following convolution code:
# Compute linear least squares estimate with intercept
const w = length(window)
const xᵗx = conv(x .* x, window)[w:end-w+1]
const xᵗy = conv(x .* y, window)[w:end-w+1]
const 𝟙ᵗx = conv(x, window)[w:end-w+1]
const 𝟙ᵗy = conv(y, window)[w:end-w+1]
const denom = w .* xᵗx - 𝟙ᵗx .^ 2
const α = (xᵗx .* 𝟙ᵗy .- 𝟙ᵗx .* xᵗy) ./ denom
const β = (w .* xᵗy .- 𝟙ᵗx .* 𝟙ᵗy) ./ denom
# Check vs. naive solution
const ref = vcat([([x[i:i+19] ones(20)] \ y[i:i+19])' for i = 1:81]...)
#assert isapprox([β α], ref)
Note that, for weighted least squares with a different window shape, some minor modifications will be needed to disentangle length(window) and sum(window) which are used interchangeably in the code above.
Since I dont need a x variable, I created a numeric series. Using RollingFunctions Package I was able to get rolling regressions through below function.
using RollingFunctions
function rolling_regression(price,windowsize)
sum_x = sum(collect(1:windowsize))
sum_x_squared = sum(collect(1:windowsize).^2)
sum_xy = rolling(sum,price,windowsize,collect(1:windowsize))
sum_y = rolling(sum,price,windowsize)
b = ((windowsize*sum_xy) - (sum_x*sum_y))/(windowsize*sum_x_squared - sum_x^2)
c = [repeat([missing],windowsize-1);b]
end
I have a latent variable model in which I produce a product term. The product term is the product of two latent variables who's scores are sampled. Currently, my model is sampling the product term. This has drastically increased the number of parameters in my model.
My original model was in non matrix formulation:
vector [N] mueta;
matrix [N ,2] xi ;
mueta = b1[1] +
b1[2]*xi[,1] +
b1[3]*xi[,2] +
b1[4]*(xi[,2].*xi[,1]) ;
I changed it to a matrix formulation wherexi[,1] is an N length vector of 1s (intercept), xi[,2:3] are factor scores, and xi[,4] is an interaction effect.
vector [N] mueta;
xi[,1] = rep_vector(1, N);
xi[,2:3] = zi * diag_pre_multiply(sigmaxi,L1)' ;
xi[,4] = (xi[,2].*xi[,3]);
mueta = xi * b1 ;
The first model does not sample the product of the xi matrix, the second formulation does. Is there a way for me to specify this in Stan so that xi[,4] is not sampled, and is just a generated value from the product of the sampled scores of the 2 factors.
I have to formulate this as an answer because I can't format code in a comment. I'd suggest declaring xi one size bigger and calculating this as
vector[N] mueta;
xi[ , 1] = rep_vector(1, N);
xi[ , 2:3] = zi * diag_pre_multiply(sigmaxi, L1)' ;
xi[ , 4] = xi[ , 2] .* xi[ , 3];
mueta = xi * b1
If xi[ , 2] and xi[ , 3] are data, then you can also precompute their elementwise product. So this can be:
transformed data {
vector[N] intercept = rep_vector(1, N);
vector[N] xi2_3 = xi[ , 2] .* xi[ , 3];
...
vector[N] mueta
= append_row(intercept,
append_row(zi * diag_pre_multiply(sigmaxi, L1)',
xi2_3))
* b1;
It'd be even better to reorganize the predictors so that you have append_row(intercept, xi2_3) defined as a transformed data variable.
It's probably possible to go further and just directly define the elements of mueta (mu_eta?) without first construting a matrix.
It looks like I solved my own issue. I wanted to post this answer for others who may have a similar problem.
vector [N] mueta;
xi[,1] = rep_vector(1, N);
xi[,2:3] = zi * diag_pre_multiply(sigmaxi,L1)' ;
mueta = (append_col(xi,(xi[,2].*xi[,3])) * b1) ;
Let's say I have a program that calculates the value of the sine wave at time t. The sine wave is of the form sin(f*t + phi). Amplitude is 1.
If I only have one sin term all is fine. I can easily calculate the value at any time t.
But, at runtime, the wave form becomes modified when it combines with other waves. sin(f1 * t + phi1) + sin(f2 * t + phi2) + sin(f3 * t + phi3) + ...
The simplest solution is to have a table with columns for phi and f, iterate over all rows, and sum the results. But to me it feels that once I reach thousands of rows, the computation will become slow.
Is there a different way of doing this? Like combining all the sines into one statement/formula?
If you have a Fourier series (i.e. f_i = i f for some f) you can use the Clenshaw recurrence relation which is significantly faster than computing all the sines (but it might be slightly less accurate).
In your case you can consider the sequence:
f_k = exp( i ( k f t + phi_k) ) , where i is the imaginary unit.
Notice that Im(f_k) = sin( k f t + phi_k ), that is your sequence.
Also
f_k = exp( i ( k f t + phi_k) ) = exp( i k f t ) exp( i phi_k )
Hence you have a_k = exp(i phi_k). You can precompute these values and store them in an array. For simplicity from now on assume a_0 = 0.
Now, exp( i (k + 1) f t) = exp(i k f t) * exp(i f t), so alpha_k = exp(i f t) and beta_k = 0.
You can now apply the recurrence formula, in C++ you can do something like this:
complex<double> clenshaw_fourier(double f, double t, const vector< complex<double> > & a )
{
const complex<double> alpha = exp(f * t * i);
complex<double> b = 0;
for (int k = a.size() - 1; k >0; -- k )
b = a[k] + alpha * b;
return a[0] + alpha * b;
}
Assuming that a[k] == exp( i phi_k ).
The real part of the answer is the sum of cos(k f t + phi_k), while the imaginary part is the sum of sin(k f t + phi_k).
As you can see this only uses addition and multiplications, except for exp(f * t * i) that is only computed once.
There are different bases (plural of basis) that can be advantageous (i.e. compact) for representing different waveforms. The most common and well-known one is that which you mention, called the Fourier basis usually. Daubechies wavelets for example are a relatively recent addition that cope with more discontinuous waveforms much better than a Fourier basis does. But this is really a math topic and probably if you post on Math Overflow you will get better answers.
I have an operation inside tight loop in R that I need to optimize. It's updating the weights inside an IRLS algorithm by calculating the Schur product of a vector and a matrix. That is, it multiplies each element in the matrix by the corresponding row value in the vector, producing a result of the same dimensions as the matrix. In overly simplified schematic form, it looks like this:
reweight = function(iter, w, Q) {
for (i in 1:iter) {
wT = w * Q
}
}
In normal R code, a new matrix of dim() [rows,cols] is created on each iteration:
cols = 1000
rows = 1000000
w = runif(rows)
Q = matrix(1.0, rows, cols)
Rprofmem()
reweight(5, w, Q)
Rprofmem(NULL)
nate#ubuntu:~/R$ less Rprofmem.out
8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
And if the matrix is large (multiple GB), the cost of the memory allocation exceeds the time spent on the numeric operation:
nate#ubuntu:~/R$ perf record -p `pgrep R` sleep 5 && perf report
49.93% R [kernel.kallsyms] [k] clear_page_c_e
47.67% R libR.so [.] real_binary
0.57% R [kernel.kallsyms] [k] get_page_from_freelist
0.35% R [kernel.kallsyms] [k] clear_huge_page
0.34% R libR.so [.] RunGenCollect
0.20% R [kernel.kallsyms] [k] clear_page
It also consumes a lot of memory:
USER PID VSZ RSS COMMAND
nate 17099 22.5GB 22.5GB /usr/local/lib/R/bin/exec/R --vanilla
If the matrix is smaller (several MB) but the number of iterations is larger, the memory usage is more reasonable, but at the cost of the garbage collector using more time than the numeric calculations:
cols = 100
rows = 10000
w = runif(rows)
Q = matrix(1.0, rows, cols)
reweight(1000, w, Q)
(note that this is a new process starting from scratch)
61.51% R libR.so [.] RunGenCollect
26.40% R libR.so [.] real_binary
7.94% R libR.so [.] SortNodes
2.79% R [kernel.kallsyms] [k] clear_page_c_e
USER PID VSZ RSS COMMAND
nate 17099 191MB 72MB /usr/local/lib/R/bin/exec/R --vanilla
If I write my own function with Rcpp that does the work in place, I can get the memory allocation that I want:
library(Rcpp)
cppFunction('
void weightMatrix(NumericVector w,
NumericMatrix Q,
NumericMatrix wQ) {
size_t numRows = Q.rows();
for (size_t row = 0; row < numRows; row++) {
wQ(row,_) = w(row) * Q(row,_);
}
return;
}
')
reweightCPP = function(iter, w, Q) {
# Initialize workspace to non-NA
wQ = matrix(1.0, nrow(Q), ncol(Q))
for (i in 1:iter) {
weightMatrix(w, Q, wQ)
}
}
cols = 100
rows = 10000
w = runif(rows)
Q = matrix(1.0, rows, cols)
wQ = matrix(NA, rows, cols)
Rprofmem()
reweightCPP(5, w, Q)
Rprofmem(NULL)
nate#ubuntu:~/R$ less Rprofmem.out
8000040 :"matrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
(What's the 2544 bytes of allocation for? It seems to be an Rcpp constant. Is there any way I can avoid it?)
Performance is still suboptimal due to the Rcpp sugar:
76.53% R sourceCpp_82335.so [.] _Z12weightMatrixN4Rcpp6VectorILi14ENS_15PreserveStorageEEENS_6MatrixILi14ES1_EES4_
10.46% R libR.so [.] Rf_getAttrib
9.53% R libR.so [.] getAttrib0
2.06% R libR.so [.] Rf_isMatrix
0.42% R libR.so [.] INTEGER
But I can mostly fix that by resorting to lower level C++:
cppFunction('
void weightMatrix(NumericVector w_,
NumericMatrix Q_,
NumericMatrix wQ_) {
size_t numCols = Q_.ncol();
size_t numRows = Q_.nrow();
double * __restrict__ w = &w_[0];
double * __restrict__ Q = &Q_[0];
double * __restrict__ wQ = &wQ_[0];
for (size_t row = 0; row < numRows; row++) {
size_t colOffset = 0;
for (size_t col = 0; col < numCols; col++) {
wQ[colOffset + row] = w[row] * Q[colOffset + row];
colOffset += numRows;
}
}
return;
}
')
99.18% R sourceCpp_59392.so [.] sourceCpp_48203_weightMatrix
0.06% R libR.so [.] PutRNGstate
0.06% R libR.so [.] do_begin
0.06% R libR.so [.] Rf_eval
That said, I still haven't figured out to get the compiler to reliably generate efficient assembly without resorting to using SIMD intrinsics to force the use of VMULPD. Even with the ugly '__restrict__' attributes, in the form shown here it seems compelled to invert my loop order and do a lot of unnecessary work. But presumably I'll find the magic cross-compiler syntax eventually, or more likely, call out to a Fortran BLAS function.
Which brings me to my questions:
Is there any way that I can get the performance I want without going to all this trouble? Failing that, is there any way that I can at least hide it behind the scenes so that the end user in R can use "wQ = w * Q" and have it magically reuse wQ instead of allocating and throwing away another giant matrix?
The BLAS wrappers in R seem to do a fairly good job for cases where the answer can be written into one of the operands (Q = w * Q), but I haven't found any way to do this when I need a "3rd party" workspace. Is there maybe some reasonable way to define a method for %=% that will convert "wQ = w * Q" to "op_mult(w, Q, wQ)"?
To preempt the question as to whether it matters: yes, I've measured, and it matters. The use case is an ensemble of cross-validated logistic regressions inside a loop handling large arrays of longitudinal data (http://cran.r-project.org/web/packages/ltmle/ltmle.pdf). It will be called millions (if not billions) of times per analysis. A good optimization of this function would help to get the runtime from "impossible" down to "days". A great optimization (or rather the combination of several such optimizations) might get it down to "hours" or even "minutes".
Edit: In the comments, Henrik correctly points out that the example loop has been simplified to the point that it simply repeats the same calculation multiple times. I hoped this would focus the issue, but perhaps it confuses it. In the real version, there will be more steps in the loop such that the 'w' in the 'w * Q' is different each iteration. Below is a poorly tested draft version of the actual functions. This one is a "semi-optimized" logistic regression in straight R based on O'Leary's QR Newton IRLS described by Bryan Lewis.
logistic_irls_qrnewton = function(A, y, maxIter=25, targetSSE=1e-16) {
# warn user below on first weight less than threshold
tinyWeightsFound = FALSE
tiny = sqrt(.Machine$double.eps)
# decompose A to QR (only once, Choleski done in loop)
QR = qr(A) # A[rows=samples, cols=covariates]
Q = qr.Q(QR) # Q[rows, cols] (same dimensions as A)
R = qr.R(QR) # R[cols, cols] (upper right triangular)
# copying now prevents copying each time y is used as argument
y = y + 0; # y[rows]
# first pass is outside loop since initial values are constant
iter = 1
t = (y - 0.5) * 4.0 # t[rows] = (y - m) * initial weight
C = chol(crossprod(Q, Q)) # C[rows, rows]
t = crossprod(Q,t)
s = forwardsolve(t(C), t) # s[cols]
s = backsolve(C, s))
t = Q %*% s
sse = crossprod(s) # sum of squared errors
print(as.vector(sse))
converged = ifelse(sse < targetSSE, 1, 0)
while (converged == 0 && iter < maxIter) {
iter = iter + 1
# only t is required as an input
dim(t) = NULL # matrix to vector to counteract crossprod
e = exp(t)
m = e / (e + 1) # mu = exp(eta) / (1 + exp(eta))
d = m / (e + 1) # mu.eta = exp(eta) / (1 + exp(eta))^2
w = d * d / (m - m^2) # W = (1 / variance) = 1 / (mu * (1 - mu))
if(tinyWeightsFound == FALSE && min(w) < tiny) {
print("Tiny weights found")
tinyWeightsFound = TRUE
}
t = crossprod(Q, w * (((y - m) / d) + t))
C = chol(crossprod(Q, w * Q))
n = forwardsolve(t(C), t)
n = backsolve(C, n)
t = Q %*% n
sse = crossprod(n - s) # divergence from previous
s = n # save divergence for difference from next
print(as.vector(sse))
if (sse < targetSSE) converged = iter
}
if (converged == 0) {
print(paste("Failed to converge after", iter, "iterations"))
print(paste("Final SSE was", sse))
} else {
print(paste("Convergence after iteration", iter))
}
coefficients = backsolve(R, crossprod(Q,t))
dim(coefficients) = NULL # return as a vector
coefficients
}