I was reading the vignette for the rgen package which provides headers for sampling from some common distributions. In the first paragraph, it says that:
Please note, these samplers, just like the ones in armadillo cannot be used in parallelized code as the underlying generation routines rely upon R calls that are single-threaded.
This was news to me, and I've been using RcppArmadillo for quite some time now. I was wondering if someone could elaborate on this point (or provide references to where I can read about the issue). I'm especially interested in learning what "cannot be used" means here; will results be wrong, or will it just not parallelize?
These functions use R's random number generator, which must not be used in parallelized code, since that leads to undefined behavior. Undefined behavior can lead to virtually anything. From my point of view you are lucky if the program crashes, since this clearly tells you that something is going wrong.
The HPC task view lists some RNGs that are suitable for parallel computation. But you cannot use them easily with the distributions provided by rgen or RcppDist. Instead, one could do the following:
Copy function for multivariate normal distribution from rgen an adjust it's signature such that it takes a std::function<double()> as source for N(0, 1) distributed random numbers.
Use a fast RNG instead of R's RNG.
Use the same fast RNG in parallel mode.
In code as a quick hack:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
inline arma::mat rmvnorm(unsigned int n, const arma::vec& mu, const arma::mat& S,
std::function<double()> rnorm = norm_rand){
unsigned int ncols = S.n_cols;
arma::mat Y(n, ncols);
Y.imbue( rnorm ) ;
return arma::repmat(mu, 1, n).t() + Y * arma::chol(S);
}
// [[Rcpp::export]]
arma::mat defaultRNG(unsigned int n, const arma::vec& mu, const arma::mat& S) {
return rmvnorm(n, mu, S);
}
// [[Rcpp::export]]
arma::mat serial(unsigned int n, const arma::vec& mu, const arma::mat& S) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
return rmvnorm(n, mu, S, [&](){return dist(rng);});
}
// [[Rcpp::export]]
std::vector<arma::mat> parallel(unsigned int n, const arma::vec& mu, const arma::mat& S, unsigned int ncores = 1) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
std::vector<arma::mat> res(ncores);
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
res[omp_get_thread_num()] = rmvnorm(n, mu, S, [&](){return dist(lrng);});
}
return res;
}
/*** R
set.seed(42)
N <- 1000000
M <- 100
mu <- rnorm(M)
S <- matrix(rnorm(M*M), M, M)
S <- S %*% t(S)
system.time(defaultRNG(N, mu, S))
system.time(serial(N, mu, S))
system.time(parallel(N/2, mu, S, 2))
*/
Result:
> system.time(defaultRNG(N, mu, S))
user system elapsed
6.984 1.380 6.881
> system.time(serial(N, mu, S))
user system elapsed
4.008 1.448 3.971
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
4.824 2.096 3.080
Here the real performance improvement comes from using a faster RNG, which is understandable since the focus here lies on many random numbers and not so much on matrix operations. If I shift more towards matrix operations by using N <- 100000 and M <- 1000 I get:
> system.time(defaultRNG(N, mu, S))
user system elapsed
16.740 1.768 9.725
> system.time(serial(N, mu, S))
user system elapsed
13.792 1.864 6.792
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
14.112 3.900 5.859
Here we clearly see that in all cases user time is larger than elapsed time. The reason for this is the parallel BLAS implementation I am using (OpenBLAS). So there are quite a few factors to consider before deciding on a method.
Related
I am trying to parallelise the addition of (large) vectors using RcppParallel. That's what I've come up with.
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
#include <assert.h>
using namespace RcppParallel;
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector directVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
results = first + second;
return results;
}
// [[Rcpp::export]]
NumericVector loopVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
for(unsigned i = 0; i != first.length(); i++)
results[i] = first[i] + second[i];
return results;
}
struct VectorAddition : public Worker
{
const RVector<double> first, second;
RVector<double> results;
VectorAddition(const NumericVector one, const NumericVector two, NumericVector three) : first(one), second(two), results(three) {}
void operator()(std::size_t a1, std::size_t a2) {
std::transform(first.begin() + a1, first.begin() + a2,
second.begin() + a1,
results.begin() + a1,
[](double i, double j) {return i + j;});
}
};
// [[Rcpp::export]]
NumericVector parallelVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
VectorAddition myVectorAddition(first, second, results);
parallelFor(0, first.length(), myVectorAddition);
return results;
}
It seems to work, but doesn't speed up things (at least not on a 4-core machine).
> v1 <- 1:1000000
> v2 <- 1000000:1
> all(directVectorAddition(v1, v2) == loopVectorAddition(v1, v2))
[1] TRUE
> all(directVectorAddition(v1, v2) == parallelVectorAddition(v1, v2))
[1] TRUE
> result <- benchmark(v1 + v2, directVectorAddition(v1, v2), loopVectorAddition(v1, v2), parallelVectorAddition(v1, v2), order="relative")
> result[,1:4]
test replications elapsed relative
1 v1 + v2 100 0.206 1.000
4 parallelVectorAddition(v1, v2) 100 0.993 4.820
2 directVectorAddition(v1, v2) 100 1.015 4.927
3 loopVectorAddition(v1, v2) 100 1.056 5.126
Can this be implemented more efficiently?
Thanks a lot in advance,
mce
Rookie mistake :) You define this as Rcpp::NumericVector but create data that is created via the sequence operator. And that creates integer values so you are forcing a copy onto all your functions!
Make it
v1 <- as.double(1:1000000)
v2 <- as.double(1000000:1)
instead, and on a machine with lots of cores (at work) I then see
R> result[,1:4]
test replications elapsed relative
4 parallelVectorAddition(v1, v2) 100 0.301 1.000
2 directVectorAddition(v1, v2) 100 0.424 1.409
1 v1 + v2 100 0.436 1.449
3 loopVectorAddition(v1, v2) 100 0.736 2.445
The example is still not that impressive because the relevant operation is "cheap" whereas the parallel approach needs to allocate memory, copy data to workers, collect again etc pp.
But the good news is that you wrote your parallel code correctly. Not a small task.
Working with Rcpp and R I observed the following behaviour, which I do not understand at the moment. Consider the following simple function written in Rcpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix hadamard_product(NumericMatrix & X, NumericMatrix & Y){
unsigned int ncol = X.ncol();
unsigned int nrow = X.nrow();
int counter = 0;
for (unsigned int j=0; j<ncol; j++) {
for (unsigned int i=0; i<nrow; i++) {
X[counter++] *= Y(i, j);
}
}
return X;
}
This simply returns the component-wise product of two matrices. Now I know that the arguments to this function are passed by reference, i.e., calling
M <- matrix(rnorm(4), ncol = 2)
N <- matrix(rnorm(4), ncol = 2)
M_copy <- M
hadamard_product(M, N)
will overwrite the original M. However, it also overwrites M_copy, which I do not understand. I thought that M_copy <- M makes a copy of the object M and saves it somewhere in the memory and not that this assignment points M_copy to M, which would be the behaviour when executing
x <- 1
y <- x
x <- 2
for example. This does not change y but only x.
So why does the behaviour above occur?
No, R does not make a copy immediately, only if it is necessary, i.e., copy-on-modify:
x <- 1
tracemem(x)
#[1] "<0000000009A57D78>"
y <- x
tracemem(x)
#[1] "<0000000009A57D78>"
x <- 2
tracemem(x)
#[1] "<00000000099E9900>"
Since you modify M by reference outside R, R can't know that a copy is necessary. If you want to ensure a copy is made, you can use data.table::copy. Or avoid the side effect in your C++ code, e.g., make a deep copy there (by using clone).
I got a document term matrix of ~1600 documents x ~120 words. I would like to compute the cosine similarity between all these vectors, but we are speaking about ~1,300,000 comparisons [n * (n - 1) / 2].
I used parallel::mclapply with 8 but it still takes forever.
Which other solution do you suggest?
Thanks
Here's my take on it.
If I define cosine similarity as
coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}
(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by #f-privé)
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
// Compute the crossprod
arma::mat res = X.t() * X;
int n = x.ncol();
arma::vec diag(n);
int i, j;
for (i=0; i<n; i++) {
diag(i) = sqrt(res(i,i));
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
res(i, j) /= diag(i)*diag(j);
return(wrap(res));
}
(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).
Comparing those yields
> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
expr min lq mean median uq max
coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
neval cld
50 a
50 b
50 a
which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with # f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.
I very agree with #ekstroem on the use of crossprod but I think there are unnecessary computations in his implementation. I think by the way that coss is giving a wrong result.
Comparing his answer with mine you can use this cpp file:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
arma::mat rowSums = sum(X % X, 0);
arma::mat res;
res = X.t() * X / sqrt(rowSums.t() * rowSums);
return(wrap(res));
}
// [[Rcpp::export]]
NumericMatrix& toCosine(NumericMatrix& mat,
const NumericVector& diag) {
int n = mat.nrow();
int i, j;
for (j = 0; j < n; j++)
for (i = 0; i < n; i++)
mat(i, j) /= diag(i) * diag(j);
return mat;
}
/*** R
coss <- function(x) {
crossprod(x)/(sqrt(crossprod(x^2)))
}
coss2 <- function(x) {
cross <- crossprod(x)
toCosine(cross, sqrt(diag(cross)))
}
XX <- matrix(rnorm(120*1600), ncol=1600)
microbenchmark::microbenchmark(
cosine_similarity(XX),
coss(XX),
coss2(XX),
times = 20
)
*/
Unit: milliseconds
expr min lq mean median uq max neval
cosine_similarity(XX) 172.1943 176.4804 181.6294 181.6345 185.7542 199.0042 20
coss(XX) 262.6167 270.9357 278.8999 274.4312 276.1176 337.0531 20
coss2(XX) 134.6742 137.6013 147.3153 140.4783 146.5806 204.2115 20
So, I will definility go for computing the crossprod in base R and then do the scaling in Rcpp.
PS: If you have a very sparse matrix, you could use package Matrix to convert your matrix to a sparse matrix. This new class of matrix also have the crossprod method so you could use coss2 as well.
The coop package's coop::cosine function is probably the best way to do this now. It is implemented in Rcpp, but also has a different approach than lsa::cosine, and also has lower memory overhead. Its use is exactly the same as lsa::cosine, just switch out the package names.
For further speedups, you may want to change your BLAS library. The coop manual has a few basic details and suggestions.
I'm computing polynomial modular inverses in a ring built over the nth-cyclotomic polynomial, with n power of 2. These inverses are computed for polynomials with the coefficient of x^0 equal to (x+1) and others sampled from some random distribution but always equal 0 or x, for some integer x.
I'm able to use NTL's InvMod to compute the inverse of several polynomials, however for big instances of the described ones it took just forever to return. I did compiled NTL 9.10.0 with GMP 6.1.1. Is there any optimization I can use on NTL to solve this operations faster?
This is a minimum code:
#include <NTL/ZZ_p.h>
#include <NTL/ZZ_pEX.h>
#include <ctime>
NTL_CLIENT
int main(){
int degree = 4095;
int nphi = 8192;
int nq = 127;
// Sets q as the mersenne prime 2^nq - 1
ZZ q = NTL::power2_ZZ(nq) - 1;
ZZ_p::init(q);
// Sets phi as the n-th cyclotomic polynomial
ZZ_pX ring;
NTL::SetCoeff(ring,0,1);
NTL::SetCoeff(ring,nphi/2,1);
ZZ_pE::init(ring);
ZZ_pEX ntl_phi;
NTL::SetCoeff(ntl_phi,0,conv<ZZ_p>(1));
NTL::SetCoeff(ntl_phi,nphi/2,conv<ZZ_p>(1));
// Initializes f
std::srand(std::time(0)); // use current time as seed for random generator
ZZ_pEX f;
NTL::SetCoeff(f,0,conv<ZZ_p>(1025));
for(int i = 1; i <= degree; i++){
int random_variable = std::rand();
NTL::SetCoeff(f,i,conv<ZZ_p>(1024*(random_variable%2 == 1)));
}
// Computes the inverse of f
ZZ_pEX fInv = NTL::InvMod(f, ntl_phi);
return 0;
}
I am looking through GSL functions to calculate Z*Z^T, where Z is n*1 column vector, but I could not find any fit function, every help is much appreciated.
GSL supports BLAS (basic linear algebra subprograms),
see [http://www.gnu.org/software/gsl/manual/html_node/GSL-BLAS-Interface.html][1].
The functions are classified by the complexity of the operation:
level 1: vector-vector operations
level 2: matrix-vector operations
level 3: matrix-matrix operations
Most functions come in different versions for float, double and complex numbers. Your operation is basically an outer product of the vector Z with itself.
You can initialize the vector as a column vector (here double precision numbers):
gsl_matrix * Z = gsl_matrix_calloc (n,1);
and then use the BLAS function gsl_blas_dgemm to compute
Z * Z^T. The first arguments of this function determine, whether or not the input matrices should be transposed before the matrix multiplication:
gsl_blas_dgemm (CblasNoTrans, CblasTrans, 1.0, Z, Z, 0.0, C);
Here's a working test program (you may need to link it against gsl and blas):
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>
int main(int argc, char ** argv)
{
size_t n = 4;
gsl_matrix * Z = gsl_matrix_calloc (n,1);
gsl_matrix * C = gsl_matrix_calloc (n,n);
gsl_matrix_set(Z,0,0,1);
gsl_matrix_set(Z,1,0,2);
gsl_matrix_set(Z,2,0,0);
gsl_matrix_set(Z,3,0,1);
gsl_blas_dgemm (CblasNoTrans,
CblasTrans, 1.0, Z, Z, 0.0, C);
int i,j;
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
printf ("%g\t", gsl_matrix_get (C, i, j));
}
printf("\n");
}
gsl_matrix_free(Z);
gsl_matrix_free(C);
return 0;
}