To accelerate my package, which include plenty of matrix calculation, i use Rcpp to
rewrite all the code. However, some functions are even slower than before. I use microbenchmark to analyze, and find the the matrix multiplication in Rcpp is slower.
Why this will happen?
And how to accelerate my package? Thanks a lot.
The Rcpp code is as follows:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix mmult(const NumericMatrix& a, const NumericMatrix& b){
if (a.ncol() != b.nrow()) stop ("Incompatible matrix dimensions");
NumericMatrix out(a.nrow(),b.ncol());
NumericVector rm1, cm2;
for (int i = 0; i < a.nrow(); ++i) {
rm1 = a(i,_);
for (int j = 0; j < b.ncol(); ++j) {
cm2 = b(_,j);
out(i,j) = std::inner_product(rm1.begin(), rm1.end(), cm2.begin(), 0.);
}
}
return out;}
The R code is as follows:
X = matrix(rnorm(10*10,1),10,10)
Y = matrix(rnorm(10*10,1),10,10)
microbenchmark(
mmult(X,Y),
X%*%Y)
The result is:
Unit: microseconds
expr min lq mean median uq max neval
mmult(X, Y) 45.720 48.9860 126.79228 50.385 51.785 6368.512 100
X %*% Y 5.599 8.8645 12.85787 9.798 10.730 153.486 100
This is the opposite but expected result from what was seen for matrix-vector multiplication. Here R is using BLAS to do all the heavy work, which might even work in parallel. You are throwing away all the optimized memory management done in the BLAS library by using your naive matrix multiplication.
Instead of trying to reinvent the low-level stuff like matrix multiplication, you could try to implement larger parts of your code using something like RcppArmadillo, which uses the same BLAS library as R but also (not only!) offers a convenient syntax on top of that.
Related
I have a slow R function I am converting to RcppArmadillo. I am very new to Rcpp and RcppArmadillo. I managed to code up something that works, but it seems clunky and not as fast at it probably could be. Mainly, I am curious how to perform matrix-vector element-wise multiplication or division of an output without needing to assign it to a new object first. For example here is my RcppArmadillo function which works:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat TTD(arma::mat X, arma::vec d, arma::vec taubar){
arma::vec top = d % sqrt(taubar);
arma::mat cubed = pow(X,3) * 4 * datum::pi;
arma::mat out = repmat(top,1,X.n_cols)/sqrt(cubed.each_col() % d);
return out;
}
But, can I do something like (and I know this is wrong): pow(X,3).n_cols() % d * 4 * datum::pi? And also how can I divide the vector top by the matrix without using repmat or assigning the product of d % sqrt(taubar) to the new object first? I don't want to change any of my inputs in place because I need them later in R for other functions.
Below is the R code which tests the function:
library(RcppArmadillo)
library(Rcpp)
sourceCpp("TTD.cpp")
d <- c(53.638145, 9.617256, 1.450767)
ttau <- matrix(c(36.405117983, 1.707159588, 0.000010000, 36.406568363, 1.707759972, 0.001141475,
36.408022369, 1.708361856, 0.002275780, 36.409480010, 1.708965246, 0.003412920),
nrow = 3, ncol = 4)
tau_bar <- c(66.48201, 9.90116, 6.08173)
TTD(X = ttau, d = d,taubar = tau_bar)
My actual matrices and vectors will be much larger, thus the need for the speed increase.
I am trying to write a function r(d, n) in rcpp. The function returns n random draws from normal distribution N(0, d). This function should be well defined, therefore the function should return the same draws whenever the d and n do not change their value.
This won't be a problem if d is restricted to be integer, in which case I can set seed and do the job
// set seed
// [[Rcpp::export]]
void set_seed(unsigned int seed) {
Rcpp::Environment base_env("package:base");
Rcpp::Function set_seed_r = base_env["set.seed"];
set_seed_r(seed);
}
// function r(d, n)
// [[Rcpp::export]]
vec randdraw(int d, int n){
set_seed(d);
vec out = randn(n);
return out;
}
But clearly I don't want to restrict d to be integer. Ideally d should be double. Any thoughts? Thank you!
The issue that I think is happening is you are trying to disperse the randn offered by Armadillo that is restricted to being a standard normal, e.g. N(0,1), such that it matches N(0, d). There are two ways to go about this since it is a standard normal.
Option 1: Using Statistical Properties
The first way involves just multiplying the sample by the square root of d, e.g. sqrt(d)*sample. This is possible due to the random variable properties of variance and expectation giving sqrt(d)*N(0, 1) ~ N(0, sqrt(d)^2) ~ N(0, d).
One of the more important things to note here is that the set_seed() function will work since the Armadillo configuration of RcppArmadillo hooks into R's RNG library to access the ::Rf_runif function to generate random values. The only area of concern is you cannot use arma::arma_rng::set_seed() to set the seed due to limitations of the R/C++ interaction detailed in Section 6.3 of Writing R Extensions. If you do use this, then you would get warned with :
When called from R, the RNG seed has to be set at the R level via set.seed()
on the first detected call.
With this being said, here is a short code example where we multiple by sqrt(d).
Code:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// set seed
// [[Rcpp::export]]
void set_seed(double seed) {
Rcpp::Environment base_env("package:base");
Rcpp::Function set_seed_r = base_env["set.seed"];
set_seed_r(std::floor(std::fabs(seed)));
}
// function r(d, n)
// [[Rcpp::export]]
arma::vec randdraw(double d, int n){
set_seed(d); // Set a seed for R's RNG library
// Call Armadillo's RNG procedure that references R's RNG capabilities
// and change dispersion slightly.
arma::vec out = std::sqrt(std::fabs(d))*arma::randn(n);
return out;
}
Output:
> randdraw(3.5, 5L)
[,1]
[1,] -0.8671559
[2,] -1.9507540
[3,] 2.9025090
[4,] -1.2953745
[5,] 2.0799176
Note: There is no direct equivalent as the rnorm procedure differs from the arma::randn generation.
Option 2: Rely upon R's RNG Functions
The second, and significantly better solution, is to explicitly rely upon R's RNG functions. Previously, we made an implicit use of R's RNG library due to RcppArmadillo's configuration. I tend to prefer this approach as you have already made an assumption that the code is specific to R when using the set_seed() function (Disclaimer: I wrote the post). If you are worried about the restriction of d being an integer, a slight coercion from double to int is possible with std::floor(std::fabs(seed)). Once the values are generated using either Rcpp::r*() or R::r*() , an armadillo vector is created using an advanced ctor that reuses the existing memory allocation.
Code:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// set seed
// [[Rcpp::export]]
void set_seed(double seed) {
Rcpp::Environment base_env("package:base");
Rcpp::Function set_seed_r = base_env["set.seed"];
set_seed_r(std::floor(std::fabs(seed)));
}
// function r(d, n)
// [[Rcpp::export]]
arma::vec randdraw(double d, int n){
set_seed(d); // Set a seed for R's RNG library
Rcpp::NumericVector draws = Rcpp::rnorm(n, 0.0, d); // Hook into R's Library
// Use Armadillo's advanced CTOR to re-use memory and cast as an armadillo object.
arma::vec out = arma::vec(draws.begin(), n, false, true);
return out;
}
Output:
> randdraw(3.21,10)
[,1]
[1,] -3.08780627
[2,] -0.93900757
[3,] 0.83071017
[4,] -3.69834335
[5,] 0.62846287
[6,] 0.09669786
[7,] 0.27419092
[8,] 3.58431878
[9,] -3.91253230
[10,] 4.06825360
> set.seed(3)
> rnorm(10, 0, 3.21)
[1] -3.08780627 -0.93900757 0.83071017 -3.69834335 0.62846287 0.09669786 0.27419092 3.58431878 -3.91253230 4.06825360
[EDIT: I understand that it is faster also because the function is written in C, but I want to know if It does a brute force search on all the training instances or something more sophisticated ]
I'm implementing in R, for studying purpose, the KNN algorithm.
I'm also checking the code correctness by comparison with the caret implementation.
The problem lies on the execution time of the two versions. My version seems to take a lot of time, instead the caret implementation is very fast (even with crossvalidation with 10 folds).
Why? I'm calculating every euclidean distance of my test instances from the training ones. Which means that I'm doing NxM distance calculation (where N are my test instances, and M my training instances):
for (i in 1:nrow(test)){
distances <- c()
classes <- c()
for(j in 1:nrow(training)){
d = calculateDistance(test[i,], training[j,])
distances <- c(distances, d)
classes <- c(classes, training[j,][[15]])
}
}
Is the caret implementation using some approximate search? Or an exact search, for example with the kd-tree? How can I speed up the search? I got 14 features for the problem, but I've been reading that the kd-tree is suggested for problem with 1 to 5 features.
EDIT:
I've found the C function called by R (VR_knn), which is pretty complex for me to understand, maybe someone can help.
Anyway I've written on the fly a brute force search in cpp, which seems to go faster than my previous R version, (but not fast as the caret C version) :
#include <Rcpp.h>
using namespace Rcpp;
double distance(NumericVector x1, NumericVector x2){
int vectorLen = x1.size();
double sum = 0;
for(int i=0;i<vectorLen-1;i++){
sum = sum + pow((x1.operator()(i)-x2.operator()(i)),2);
}
return sqrt(sum);
}
// [[Rcpp::export]]
void searchCpp(NumericMatrix training, NumericMatrix test) {
int numRowTr = training.rows();
int numColTr = training.cols();
int numRowTe = test.rows();
int numColTe = test.cols();
for (int i=0;i<numRowTe;i++)
{
NumericVector test_i = test.row(i);
NumericVector distances = NumericVector(numRowTe);
for (int j=0;j<numRowTr;j++){
NumericVector train_j = training.row(j);
double dist = distance(test_i, train_j);
distances.insert(i,dist);
}
}
}
I tried using the eigen solver of the Eigen library in R to improve performance:
// [[Rcpp::export]]
MatrixXd Eigen4(const Map<MatrixXd> bM) {
SelfAdjointEigenSolver<MatrixXd> es(bM);
return(es.eigenvectors());
}
Yet, when comparing on a 2000x2000 matrix:
n <- 5e3
m <- 2e3
b <- crossprod(matrix(rnorm(n*m), n))
print(system.time(test <- Eigen4(b))) # 18 sec
print(system.time(test2 <- eigen(b, symmetric = TRUE))) # 8.5 sec
For the result of microbenchmark:
Unit: seconds
expr min lq mean median uq max neval
Eigen4(b) 18.614694 18.687407 19.136380 18.952063 19.292021 20.812116 10
eigen(b, symmetric = TRUE) 8.652628 8.663302 8.696543 8.676914 8.718517 8.831664 10
R is twice as fast as Eigen ?
I'm using latest versions of R and RcppEigen.
Am I doing something wrong ?
R's eigen is an interface to Fortran functions from LAPACK. Eigen uses its generic C++ code by default, although it can be configured to use external BLAS/LAPACK backends for certain dense matrix operations, including eigendecomposition. Depending on your architecture and compilers, R's default LAPACK may well be faster. If you configure both R and Eigen to use the same highly optimized platform-specific BLAS/LAPACK (e.g. MKL on Intel) you should get virtually identical (and better) results.
I have written what I believe to be a semi-quick ols-regression function
ols32 <- function (y, x,Ridge=1.1) {
xrd<-crossprod(x)
xry<-crossprod(x, y)
diag(xrd)<-Ridge*diag(xrd)
solve(xrd,xry)
}
Now I want to apply this to the following
(vapply(1:la, function(J)
ME %*% ols32((nza[,J]),(cbind(nzaf1[,J],nzaf2[,J],nza[,-J],MOMF)))
[(la+2):(2*la+1)],FUN.VALUE=0))
Where nza,nzaf1,nzaf2 and MOMF are 500x50 matrixes and la=50 and ME is a vector of length 50.
So what I actually do is I do a regression but only use the beta-coefficients from MOMF which I multiply by ME.
nza.mat<-matrix(rnorm(500*200),ncol=200)
nza<-nza.mat[,1:50]
nzaf2<-nza.mat[,101:150]
MOMF<-nza.mat[,151:200]
nzaf1<-nza.mat[,51:100]
ME<-nza.mat[500,151:200]
Is there an imediate way of making things faster or do I need to use someting like RcppEigen?
Tks P
So I came up with a slightly faster way of solving this by rewriting my ols-function so that it calculates the two crossproducts only once for a whole matrix. The new function looks like this:
ols.quick <- function (y, x, ME) {
la<-dim(y)[2]
XX.cross<-crossprod(x)
XY.cross<-crossprod(x, y)
diag(XX.cross)<-Ridge*diag(XX.cross)
betas<-sapply(1:la, function(J){
idx<-c((1:la)[-J],la+J,2*la+J,(3*la+1):(4*la));
solve(XX.cross[idx,idx],XY.cross[idx,J])},simplify=T)
ME%*%betas[(la+2):(2*la+1),]
}
where
y=nza (500x50) and x=cbind(nza,nzaf1,nzaf2,MOMF) (500x200)
This solves the problem about 3.5 times faster.
microbenchmark(ols.quick(nza,nza.mat,ME),
vapply(1:la, function(J) ME%*%ols32(nza[,J],(cbind(nzaf1[,J],nzaf2[,J],nza[,-J],MOMF)))
[(la+2): (lb+2)],FUN.VALUE=0))
min lq median uq max neval
66.30495 67.71903 68.57001 70.17742 77.80069 100
251.59395 255.43306 258.35041 262.85742 296.51313 100
I suppose I could gain some speed with parLapply from the parallel package but I havet looked into that yet.