Fast sampling from Truncated Normal Distribution using Rcpp and openMP - r

UPDATE:
I tried to implement Dirk's suggestions. Comments?
I am busy right now at JSM, but I'd like to get some feedback before knitting an Rmd for the gallery.
I switched back from Armadillo to normal Rcpp, as it didn't add any value.
Scalar versions with R:: are quite nice.
I should maybe put in a parameter n for the number of draws if mean/sd are entered as scalar, not as vectors of the desired output length.
There are lots of MCMC application that require drawing samples from truncated Normal distributions. I built on an existing implementation of the TN and added parallel computation to it.
Issues:
Does anyone see further potential speed improvements? In the last case from the benchmark, rtruncnorm is sometimes faster. The Rcpp implementation is always faster than existing packages, but can it be improved even further?
I ran it within a complex model I can't share, and my R session crashed. However, I cannot systematically reproduce it, so it could have been another part of the code. If someone is working with the TN, please test it and let me know. Update: I haven't had issues with the updated code, but let me know.
How I put things together:
To my knowledge, the fastest implementation is not on CRAN, but the source code can be downloaded OSU stat. Competing implementations in msm and truncorm were slower in my benchmarks. The trick is to efficiently adjust proposal distributions, where the Exponential works nicely for the tails of the truncated Normal.
So I took Chris' code, "Rcpp'ed" it and added some openMP spice to it. The dynamic schedule is optimal here, as sampling can take more or less time depending on the boundaries.
One thing I found nasty: lots of the statistical distributions are based on the NumericVector type, when I wanted to work with doubles. I just coded my way around that.
Heres the Rcpp code:
#include <Rcpp.h>
#include <omp.h>
// norm_rs(a, b)
// generates a sample from a N(0,1) RV restricted to be in the interval
// (a,b) via rejection sampling.
// ======================================================================
// [[Rcpp::export]]
double norm_rs(double a, double b)
{
double x;
x = Rf_rnorm(0.0, 1.0);
while( (x < a) || (x > b) ) x = norm_rand();
return x;
}
// half_norm_rs(a, b)
// generates a sample from a N(0,1) RV restricted to the interval
// (a,b) (with a > 0) using half normal rejection sampling.
// ======================================================================
// [[Rcpp::export]]
double half_norm_rs(double a, double b)
{
double x;
x = fabs(norm_rand());
while( (x<a) || (x>b) ) x = fabs(norm_rand());
return x;
}
// unif_rs(a, b)
// generates a sample from a N(0,1) RV restricted to the interval
// (a,b) using uniform rejection sampling.
// ======================================================================
// [[Rcpp::export]]
double unif_rs(double a, double b)
{
double xstar, logphixstar, x, logu;
// Find the argmax (b is always >= 0)
// This works because we want to sample from N(0,1)
if(a <= 0.0) xstar = 0.0;
else xstar = a;
logphixstar = R::dnorm(xstar, 0.0, 1.0, 1.0);
x = R::runif(a, b);
logu = log(R::runif(0.0, 1.0));
while( logu > (R::dnorm(x, 0.0, 1.0,1.0) - logphixstar))
{
x = R::runif(a, b);
logu = log(R::runif(0.0, 1.0));
}
return x;
}
// exp_rs(a, b)
// generates a sample from a N(0,1) RV restricted to the interval
// (a,b) using exponential rejection sampling.
// ======================================================================
// [[Rcpp::export]]
double exp_rs(double a, double b)
{
double z, u, rate;
// Rprintf("in exp_rs");
rate = 1/a;
//1/a
// Generate a proposal on (0, b-a)
z = R::rexp(rate);
while(z > (b-a)) z = R::rexp(rate);
u = R::runif(0.0, 1.0);
while( log(u) > (-0.5*z*z))
{
z = R::rexp(rate);
while(z > (b-a)) z = R::rexp(rate);
u = R::runif(0.0,1.0);
}
return(z+a);
}
// rnorm_trunc( mu, sigma, lower, upper)
//
// generates one random normal RVs with mean 'mu' and standard
// deviation 'sigma', truncated to the interval (lower,upper), where
// lower can be -Inf and upper can be Inf.
//======================================================================
// [[Rcpp::export]]
double rnorm_trunc (double mu, double sigma, double lower, double upper)
{
int change;
double a, b;
double logt1 = log(0.150), logt2 = log(2.18), t3 = 0.725;
double z, tmp, lograt;
change = 0;
a = (lower - mu)/sigma;
b = (upper - mu)/sigma;
// First scenario
if( (a == R_NegInf) || (b == R_PosInf))
{
if(a == R_NegInf)
{
change = 1;
a = -b;
b = R_PosInf;
}
// The two possibilities for this scenario
if(a <= 0.45) z = norm_rs(a, b);
else z = exp_rs(a, b);
if(change) z = -z;
}
// Second scenario
else if((a * b) <= 0.0)
{
// The two possibilities for this scenario
if((R::dnorm(a, 0.0, 1.0,1.0) <= logt1) || (R::dnorm(b, 0.0, 1.0, 1.0) <= logt1))
{
z = norm_rs(a, b);
}
else z = unif_rs(a,b);
}
// Third scenario
else
{
if(b < 0)
{
tmp = b; b = -a; a = -tmp; change = 1;
}
lograt = R::dnorm(a, 0.0, 1.0, 1.0) - R::dnorm(b, 0.0, 1.0, 1.0);
if(lograt <= logt2) z = unif_rs(a,b);
else if((lograt > logt1) && (a < t3)) z = half_norm_rs(a,b);
else z = exp_rs(a,b);
if(change) z = -z;
}
double output;
output = sigma*z + mu;
return (output);
}
// rtnm( mu, sigma, lower, upper, cores)
//
// generates one random normal RVs with mean 'mu' and standard
// deviation 'sigma', truncated to the interval (lower,upper), where
// lower can be -Inf and upper can be Inf.
// mu, sigma, lower, upper are vectors, and vectorized calls of this function
// speed up computation
// cores is an intege, representing the number of cores to be used in parallel
//======================================================================
// [[Rcpp::export]]
Rcpp::NumericVector rtnm(Rcpp::NumericVector mus, Rcpp::NumericVector sigmas, Rcpp::NumericVector lower, Rcpp::NumericVector upper, int cores){
omp_set_num_threads(cores);
int nobs = mus.size();
Rcpp::NumericVector out(nobs);
double logt1 = log(0.150), logt2 = log(2.18), t3 = 0.725;
double a,b, z, tmp, lograt;
int change;
#pragma omp parallel for schedule(dynamic)
for(int i=0;i<nobs;i++) {
a = (lower(i) - mus(i))/sigmas(i);
b = (upper(i) - mus(i))/sigmas(i);
change=0;
// First scenario
if( (a == R_NegInf) || (b == R_PosInf))
{
if(a == R_NegInf)
{
change = 1;
a = -b;
b = R_PosInf;
}
// The two possibilities for this scenario
if(a <= 0.45) z = norm_rs(a, b);
else z = exp_rs(a, b);
if(change) z = -z;
}
// Second scenario
else if((a * b) <= 0.0)
{
// The two possibilities for this scenario
if((R::dnorm(a, 0.0, 1.0,1.0) <= logt1) || (R::dnorm(b, 0.0, 1.0, 1.0) <= logt1))
{
z = norm_rs(a, b);
}
else z = unif_rs(a,b);
}
// Third scenario
else
{
if(b < 0)
{
tmp = b; b = -a; a = -tmp; change = 1;
}
lograt = R::dnorm(a, 0.0, 1.0, 1.0) - R::dnorm(b, 0.0, 1.0, 1.0);
if(lograt <= logt2) z = unif_rs(a,b);
else if((lograt > logt1) && (a < t3)) z = half_norm_rs(a,b);
else z = exp_rs(a,b);
if(change) z = -z;
}
out(i)=sigmas(i)*z + mus(i);
}
return(out);
}
And here is the benchmark:
libs=c("truncnorm","msm","inline","Rcpp","RcppArmadillo","rbenchmark")
if( sum(!(libs %in% .packages(all.available = TRUE)))>0){ install.packages(libs[!(libs %in% .packages(all.available = TRUE))])}
for(i in 1:length(libs)) {library(libs[i],character.only = TRUE,quietly=TRUE)}
#needed for openMP parallel
Sys.setenv("PKG_CXXFLAGS"="-fopenmp")
Sys.setenv("PKG_LIBS"="-fopenmp")
#no of cores for openMP version
cores = 4
#surce code from same dir
Rcpp::sourceCpp('truncnorm.cpp')
#sample size
nn=1000000
bb= 100
aa=-100
benchmark( rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),cores), rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),1),rtnorm(nn,rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn)),rtruncnorm(nn, a=aa, b=100, mean = 0, sd = 1) , order="relative", replications=3 )[,1:4]
aa=0
benchmark( rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),cores), rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),1),rtnorm(nn,rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn)),rtruncnorm(nn, a=aa, b=100, mean = 0, sd = 1) , order="relative", replications=3 )[,1:4]
aa=2
benchmark( rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),cores), rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),1),rtnorm(nn,rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn)),rtruncnorm(nn, a=aa, b=100, mean = 0, sd = 1) , order="relative", replications=3 )[,1:4]
aa=50
benchmark( rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),cores), rtnm(rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn),1),rtnorm(nn,rep(0,nn),rep(1,nn),rep(aa,nn),rep(100,nn)),rtruncnorm(nn, a=aa, b=100, mean = 0, sd = 1) , order="relative", replications=3 )[,1:4]
Several benchmark runs are necessary as the speed depends on the upper/lower boundaries. For different cases, different parts of the algorithm kick in.

Really quick comments:
if you include RcppArmadillo.h you do not need to include Rcpp.h -- in fact, you should not and we even test that
rep(oneDraw, n) makes n calls. I would write a function to be called once that returns you n draws -- it will be faster as you save yourself n-1 function call overheads
Your comment on lots of the statistical distributions are based on the NumericVector type, when I wanted to work with doubles may reveal some misunderstanding: NumericVector is our convenient proxy class for internal R types: no copies. You are free to use std::vector<double> or whichever form you prefer.
I know little about truncated normals so I cannot comment on the specifics of your algorithms.
Once you have it worked out consider a post for the Rcpp Gallery.

Related

Multiple multivariate normal density values in R and Rcpp

I have a question concerning a fast implementation. Imagine that you have a matrix Ys in which each row refers to a vector of observed values stemming from a multivariate normal distribution, e.g.,
Ys = matrix(c(1.0,1.0,1.0,0.0,0.5,0.6,0.1,0.1,0.3), nrow = 3, ncol = 3)
Furthermore, there is a matrix Sigs in which each row refers to the diagonal elements of the variance covariance matrix for each of the outcome vectors in Ys, e.g.,
Sigs = matrix(c(1.0,0.5,0.1,0.2,0.3,0.4,0.3,0.7,0.8), nrow = 3, ncol = 3)
What I want to do is to compute the density value of each row in Ys given the diagonal elemnts in the respective row in Sigs.
One could use a for-loop in R, e.g.
colSigs = ncol(Sigs)
res = rep(0,3)
means = rep(0,colSigs)
for (i in 1:nrow(Ys) ) {
sigma = diag(Sigs[i,],colSigs)
res[i] = mvtnorm::dmvnorm(Ys[i,],means,sigma)
}
however, in my case Ys and Sigs contain about 100,000 rows. So I wrote an Rcpp-function that is considerably faster. Nevertheless, I was wondering whether there is a fancy trick (a more efficient way) so that I do not have to do looping? Any ideas are welcome.
----
EDIT: I was asked to add the Rcpp functions. Here, you go:
This function computes the quadratic form appearing in the multivariate normal density:
double dmvnorm_distance( arma::rowvec y, arma::mat Sigma )
{
int n = Sigma.n_rows;
double res=0;
double fac=1;
for (int ii=0; ii<n; ii++){
for (int jj=ii; jj<n; jj++){
if (ii==jj){ fac = 1; } else { fac = 2;}
res += fac *y(0,ii) * Sigma(ii,jj) * y(0,jj);
}
}
return res;
}
This function computes the density value:
double dmvnorm_rcpp( arma::rowvec y, arma::mat Sigma )
{
int p = Sigma.n_rows;
// inverse Sigma
arma::mat Sigma1 = arma::inv(Sigma);
// determinant Sigma
double det_Sigma = arma::det(Sigma);
// distance
double dist = dmvnorm_distance( y, Sigma1);
double pi1 = 3.14159265358979;
double l1 = - p * std::log(2*pi1) - dist - std::log( det_Sigma );
double ll = 0.5 * l1;
return ll;
}
and this function contains the for-loop and is called from R:
Rcpp::NumericVector mvnorm_loop( arma::mat Ys, arma::mat SIGs )
{
int n = Ys.n_rows;
Rcpp::NumericVector out(n);
for (int ii=0; ii<n; ii++){
// get yi and diagonal entries
arma::rowvec yi = Ys.row(ii);
arma::rowvec si = SIGs.row(ii);;
// make Sigma
arma::mat Sigma = arma::diagmat(si);
// compute likelihood value
out[ii] = dmvnorm_rcpp( yi, Sigma );
}
return out;
}
So basically the question is whether there is an alternative way to implement the insertion in Rcpp to make the whole thing even more faster.
----
Best,
Stefan
PS: I also used apply in R and it is slower than the Rcpp loop-function.

Rcpp implementation of mvtnorm::pmvnorm slower than original R function

I am trying to get a Rcpp version of pmvnorm to work at least as fast as mvtnorm::pmvnorm in R.
I have found https://github.com/zhanxw/libMvtnorm and created a Rcpp skeleton package with the relevant source files. I have added the following functions which make use of Armadillo (since I'm using it across other code I've been writing).
//[[Rcpp::export]]
arma::vec triangl(const arma::mat& X){
arma::mat LL = arma::trimatl(X, -1); // omit the main diagonal
return LL.elem(arma::find(LL != 0));
}
//[[Rcpp::export]]
double pmvnorm_cpp(arma::vec& bound, arma::vec& lowtrivec){
double error;
int n = bound.n_elem;
double* boundptr = bound.memptr();
double* lowtrivecptr = lowtrivec.memptr();
double result = pmvnorm_P(n, boundptr, lowtrivecptr, &error);
return result;
}
From R after building the package, this is a reproducible example:
set.seed(1)
covar <- rWishart(1, 10, diag(5))[,,1]
sds <- diag(covar) ^-.5
corrmat <- diag(sds) %*% covar %*% diag(sds)
triang <- triangl(corrmat)
bounds <- c(0.5, 0.9, 1, 4, -1)
rbenchmark::benchmark(pmvnorm_cpp(bounds, triang),
mvtnorm::pmvnorm(upper=bounds, corr = corrmat),
replications=1000)
Which shows that pmvnorm_cpp is much slower than mvtnorm::pmvnorm. and the result is different.
> pmvnorm_cpp(bounds, triang)
[1] 0.04300643
> mvtnorm::pmvnorm(upper=bounds, corr = corrmat)
[1] 0.04895361
which puzzles me because I thought the base fortran code was the same. Is there something in my code that makes everything go slow? Or should I try to port the mvtnorm::pmvnorm code directly? I have literally no experience with fortran.
Suggestions appreciated, excuse my incompetence othewise.
EDIT: to make a quick comparison with an alternative, this:
//[[Rcpp::export]]
NumericVector pmvnorm_cpp(NumericVector bound, NumericMatrix cormat){
Environment stats("package:mvtnorm");
Function f = stats["pmvnorm"];
NumericVector lower(bound.length(), R_NegInf);
NumericVector mean(bound.length());
NumericVector res = f(lower, bound, mean, cormat);
return res;
}
has essentially the same performance as an R call (the following on a 40-dimensional mvnormal):
> rbenchmark::benchmark(pmvnorm_cpp(bounds, corrmat),
+ mvtnorm::pmvnorm(upper=bounds, corr = corrmat),
+ replications=100)
test replications elapsed relative user.self sys.self
2 mvtnorm::pmvnorm(upper = bounds, corr = corrmat) 100 16.86 1.032 16.60 0.00
1 pmvnorm_cpp(bounds, corrmat) 100 16.34 1.000 16.26 0.01
so it seems to me there must be something going on in the previous code. either with how I'm handling things with Armadillo, or how the other things are connected. I would assume that there should be a performance gain compared to this last implementation.
Instead of trying to use an additional library for this, I would try to use the C API exported by mvtnorm, c.f. https://github.com/cran/mvtnorm/blob/master/inst/NEWS#L44-L48. While doing so, I found three reasons why the results differ. One of them is also responsible for the preformance difference:
mvtnorm uses R's RNG, while this has been removed from the library you are using, c.f. https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/randomF77.c.
Your triangl function is incorrect. It returns the lower triangular matrix in column-major order. However, the underlying fortran code expects it in row-major order, c.f. https://github.com/cran/mvtnorm/blob/master/src/mvt.f#L36-L39 and https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/mvtnorm.cpp#L60
libMvtnorm uses 1e-6 instead of 1e-3 as relative precision, c.f. https://github.com/zhanxw/libMvtnorm/blob/master/libMvtnorm/mvtnorm.cpp#L65. This is also responsible for the performance difference.
We can test this using the following code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::depends(mvtnorm)]]
#include <mvtnormAPI.h>
//[[Rcpp::export]]
arma::vec triangl(const arma::mat& X){
int n = X.n_cols;
arma::vec res(n * (n-1) / 2);
for (int i = 0; i < n; ++i) {
for (int j = 0; j < i; ++j) {
res(j + i * (i-1) / 2) = X(i, j);
}
}
return res;
}
// [[Rcpp::export]]
double pmvnorm_cpp(arma::vec& bound,
arma::vec& lowertrivec,
double abseps = 1e-3){
int n = bound.n_elem;
int nu = 0;
int maxpts = 25000; // default in mvtnorm: 25000
double releps = 0; // default in mvtnorm: 0
int rnd = 1; // Get/PutRNGstate
double* bound_ = bound.memptr();
double* correlationMatrix = lowertrivec.memptr();
double* lower = new double[n];
int* infin = new int[n];
double* delta = new double[n];
for (int i = 0; i < n; ++i) {
infin[i] = 0; // (-inf, bound]
lower[i] = 0.0;
delta[i] = 0.0;
}
// return values
double error;
double value;
int inform;
mvtnorm_C_mvtdst(&n, &nu, lower, bound_,
infin, correlationMatrix, delta,
&maxpts, &abseps, &releps,
&error, &value, &inform, &rnd);
delete[] (lower);
delete[] (infin);
delete[] (delta);
return value;
}
/*** R
set.seed(1)
covar <- rWishart(1, 10, diag(5))[,,1]
sds <- diag(covar) ^-.5
corrmat <- diag(sds) %*% covar %*% diag(sds)
triang <- triangl(corrmat)
bounds <- c(0.5, 0.9, 1, 4, -1)
set.seed(1)
system.time(cat(mvtnorm::pmvnorm(upper=bounds, corr = corrmat), "\n"))
set.seed(1)
system.time(cat(pmvnorm_cpp(bounds, triang, 1e-6), "\n"))
set.seed(1)
system.time(cat(pmvnorm_cpp(bounds, triang, 0.001), "\n"))
*/
Results:
> system.time(cat(mvtnorm::pmvnorm(upper=bounds, corr = corrmat), "\n"))
0.04896221
user system elapsed
0.000 0.003 0.003
> system.time(cat(pmvnorm_cpp(bounds, triang, 1e-6), "\n"))
0.04895756
user system elapsed
0.035 0.000 0.035
> system.time(cat(pmvnorm_cpp(bounds, triang, 0.001), "\n"))
0.04896221
user system elapsed
0.004 0.000 0.004
With the same RNG (and RNG state), the correct lower triangular correlation matrix and the same relative precision, results are identical and performance is comparable. With higher precision, performance suffers.
All this is for a stand-alone file using Rcpp::sourceCpp. In order to use this in a package, you need to add LinkingTo: mvtnorm to your DESCRIPTION file.

Euclidean distance matrix performance between two shapes

The problem I am having is that I have to calculate a Euclidean distance matrix between shapes that can range from 20,000 up to 60,000 points, which produces 10-20GB amounts of data. I have to run each of these calculates thousands of times so 20GB x 7,000 (each calculation is a different point cloud). The shapes can be either 2D or 3D.
EDITED (Updated questions)
Is there a more efficient way to calculate the forward and backward distances without using two separate nested loops?
I know I could save the data matrix and calculate the minimum
distances in each direction, but then there is a huge memory issue
with large point clouds.
Is there a way to speed up this calculation and/or clean up the code to trim off time?
The irony is that I only need the matrix to calculate a very simple metric, but it requires the entire matrix to find that metric (Average Hausdorff distance).
Data example where each column represents a dimension of the shape and each row is a point in the shape:
first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")
This code calculates a Euclidean distance between between coordinates:
m <- nrow(first_configuration)
n <- nrow(second_configuration)
D <- sqrt(pmax(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration), 0))
D
Output:
[,1] [,2]
[1,] 8.660254 10.392305
[2,] 6.928203 8.660254
EDIT: included hausdorff average code
d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)
EDIT (Rcpp solution):
Here is my attempt to implement it in Rcpp so the matrix is never saved to memory. Working now but very slow.
sourceCpp(code=
#include <Rcpp.h>
#include <limits>
using namespace Rcpp;
// [[Rcpp::export]]
double edist_rcpp(NumericVector x, NumericVector y){
double d = sqrt( sum( pow(x - y, 2) ) );
return d;
}
// [[Rcpp::export]]
double avg_hausdorff_rcpp(NumericMatrix x, NumericMatrix y){
int nrowx = x.nrow();
int nrowy = y.nrow();
double new_low_x = std::numeric_limits<int>::max();
double new_low_y = std::numeric_limits<int>::max();
double mean_forward = 0;
double mean_backward = 0;
double mean_hd;
double td;
//forward
for(int i = 0; i < nrowx; i++) {
for(int j = 0; j < nrowy; j++) {
NumericVector v1 = x.row(i);
NumericVector v2 = y.row(j);
td = edist_rcpp(v1, v2);
if(td < new_low_x) {
new_low_x = td;
}
}
mean_forward = mean_forward + new_low_x;
new_low_x = std::numeric_limits<int>::max();
}
//backward
for(int i = 0; i < nrowy; i++) {
for(int j = 0; j < nrowx; j++) {
NumericVector v1 = y.row(i);
NumericVector v2 = x.row(j);
td = edist_rcpp(v1, v2);
if(td < new_low_y) {
new_low_y = td;
}
}
mean_backward = mean_backward + new_low_y;
new_low_y = std::numeric_limits<int>::max();
}
//hausdorff mean
mean_hd = (mean_forward / nrowx + mean_backward / nrowy) / 2;
return mean_hd;
}
)
EDIT (RcppParallel solution):
Definitely faster than the serial Rcpp solution and most certainly the R solution. If anyone has tips on how to improve my RcppParallel code to trim off some extra time it would be much appreciated!
sourceCpp(code=
#include <Rcpp.h>
#include <RcppParallel.h>
#include <limits>
// [[Rcpp::depends(RcppParallel)]]
struct minimum_euclidean_distances : public RcppParallel::Worker {
//Input
const RcppParallel::RMatrix<double> a;
const RcppParallel::RMatrix<double> b;
//Output
RcppParallel::RVector<double> medm;
minimum_euclidean_distances(const Rcpp::NumericMatrix a, const Rcpp::NumericMatrix b, Rcpp::NumericVector medm) : a(a), b(b), medm(medm) {}
void operator() (std::size_t begin, std::size_t end) {
for(std::size_t i = begin; i < end; i++) {
double new_low = std::numeric_limits<double>::max();
for(std::size_t j = 0; j < b.nrow(); j++) {
double dsum = 0;
for(std::size_t z = 0; z < b.ncol(); z++) {
dsum = dsum + pow(a(i,z) - b(j,z), 2);
}
dsum = pow(dsum, 0.5);
if(dsum < new_low) {
new_low = dsum;
}
}
medm[i] = new_low;
}
}
};
// [[Rcpp::export]]
double mean_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
Rcpp::NumericVector medm(a.nrow());
minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);
double results = Rcpp::sum(medm);
results = results / a.nrow();
return results;
}
// [[Rcpp::export]]
double max_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
Rcpp::NumericVector medm(a.nrow());
minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);
double results = Rcpp::max(medm);
return results;
}
)
Benchmarks using large point clouds of sizes 37,775 and 36,659:
//Rcpp serial solution
system.time(avg_hausdorff_rcpp(ll,rr))
user system elapsed
409.143 0.000 409.105
//RcppParallel solution
system.time(mean(mean_directional_hausdorff_rcpp(ll,rr), mean_directional_hausdorff_rcpp(rr,ll)))
user system elapsed
260.712 0.000 33.265
I try to use JuliaCall to do the calculation for the average Hausdorff distance.
JuliaCall embeds Julia in R.
I only try a serial solution in JuliaCall. It seems to be faster than the RcppParallel and the Rcpp serial solution in the question, but I don't have the benchmark data. Since ability for parallel computation is built in Julia. A parallel computation version in Julia should be written without much difficulty. I will update my answer after finding that out.
Below is the julia file I wrote:
# Calculate the min distance from the k-th point in as to the points in bs
function min_dist(k, as, bs)
n = size(bs, 1)
p = size(bs, 2)
dist = Inf
for i in 1:n
r = 0.0
for j in 1:p
r += (as[k, j] - bs[i, j]) ^ 2
## if r is already greater than the upper bound,
## then there is no need to continue doing the calculation
if r > dist
continue
end
end
if r < dist
dist = r
end
end
sqrt(dist)
end
function avg_min_dist_from(as, bs)
distsum = 0.0
n1 = size(as, 1)
for k in 1:n1
distsum += min_dist_from(k, as, bs)
end
distsum / n1
end
function hausdorff_avg_dist(as, bs)
(avg_min_dist_from(as, bs) + avg_min_dist_from(bs, as)) / 2
end
And this is the R code to use the julia function:
first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")
m <- nrow(first_configuration)
n <- nrow(second_configuration)
D <- sqrt(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration))
D
d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)
library(JuliaCall)
## the first time of julia_setup could be quite time consuming
julia_setup()
## source the julia file which has our hausdorff_avg_dist function
julia_source("hausdorff.jl")
## check if the julia function is correct with the example
average_hausdorff_julia <- julia_call("hausdauff_avg_dist",
first_configuration,
second_configuration)
## generate some large random point clouds
n1 <- 37775
n2 <- 36659
as <- matrix(rnorm(n1 * 3), n1, 3)
bs <- matrix(rnorm(n2 * 3), n2, 3)
system.time(julia_call("hausdauff_avg_dist", as, bs))
The time on my laptop was less than 20 seconds, note this is performance of the serial version of JuliaCall! I used the same data to test RCpp serial solution in the question, which took more than 10 minutes to run. I don't have RCpp parallel on my laptop now so I can't try that. And as I said, Julia has built-in ability to do parallel computation.

Parallel computation of a quadratic term in Rcpp

Let Y and K be an n-dimensional (column) vector and n by n matrix, respectively. Think of Y and K as a sample vector and its covariance matrix.
Corresponding to each entry of Y (say Yi) there is a row vector (of size 2) Si encoding the location of the sample in a two dimensional space. Construct the n by 2 matrix S by concatenating all the Si vectors. The ij-th entry of K is of the form
Kij= f( |si-sj|, b )
in which |.| denotes the usual Euclidean norm, f is the covariance function and b represents the covariance parameters. For instance for powered exponential covariance we have f(x) = exp( (-|x|/r)q ) and b = (r,q).
The goal is to compute the following quantity in Rcpp, using a parallel fashion. (YT stands for Y transpose and ||.||2 denotes the sum of square entries of K).
YTKY ⁄ ||K||2
Here is the piece of code I've written to do the job. While running, Rstudio runs out of memory after a few seconds and the following massage displays: "R encountered a fatal error. The session was terminated". I've very recently started using open MP in Rcpp and I have no idea why this happens! Can anybody tell me what have I done wrong here?
#include <Rcpp.h>
#include<math.h>
#include<omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
double InnerProd(NumericVector x, NumericVector y) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i]*y[i];
}
return total;
}
// [[Rcpp::export]]
double CorFunc(double r, double range_param, double beta) {
double q,x;
x = r/range_param;
q = exp( -pow(x,beta) );
return(q);
}
// [[Rcpp::export]]
double VarianceComp( double range, NumericVector Y, NumericMatrix s, double
beta, int t ){
int n,i,j;
double Numer = 0, Denom = 0, dist, CorVal, ObjVal;
NumericVector DistVec;
n = Y.size();
omp_set_num_threads(t);
# pragma omp parallel for private(DistVec,CorVal,dist,j) \
reduction(+:Numer,Denom)
for( i = 0; i < n; ++i) {
for( j = 0; j < n; ++j){
DistVec = ( s(i,_)-s(j,_) );
dist = sqrt( InnerProd(DistVec,DistVec) );
CorVal = CorFunc(dist,range,beta);
Numer += Y[i]*Y[j]*CorVal/n;
Denom += pow( CorVal, 2 )/n;
}
}
ObjVal = Numer/Denom;
return( ObjVal );
}

Optimization of Fibonacci sequence generating algorithm

As we all know, the simplest algorithm to generate Fibonacci sequence is as follows:
if(n<=0) return 0;
else if(n==1) return 1;
f(n) = f(n-1) + f(n-2);
But this algorithm has some repetitive calculation. For example, if you calculate f(5), it will calculate f(4) and f(3). When you calculate f(4), it will again calculate both f(3) and f(2). Could someone give me a more time-efficient recursive algorithm?
I have read about some of the methods for calculating Fibonacci with efficient time complexity following are some of them -
Method 1 - Dynamic Programming
Now here the substructure is commonly known hence I'll straightly Jump to the solution -
static int fib(int n)
{
int f[] = new int[n+2]; // 1 extra to handle case, n = 0
int i;
f[0] = 0;
f[1] = 1;
for (i = 2; i <= n; i++)
{
f[i] = f[i-1] + f[i-2];
}
return f[n];
}
A space-optimized version of above can be done as follows -
static int fib(int n)
{
int a = 0, b = 1, c;
if (n == 0)
return a;
for (int i = 2; i <= n; i++)
{
c = a + b;
a = b;
b = c;
}
return b;
}
Method 2- ( Using power of the matrix {{1,1},{1,0}} )
This an O(n) which relies on the fact that if we n times multiply the matrix M = {{1,1},{1,0}} to itself (in other words calculate power(M, n )), then we get the (n+1)th Fibonacci number as the element at row and column (0, 0) in the resultant matrix. This solution would have O(n) time.
The matrix representation gives the following closed expression for the Fibonacci numbers:
fibonaccimatrix
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
/*multiplies 2 matrices F and M of size 2*2, and
puts the multiplication result back to F[][] */
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
/*function that calculates F[][] raise to the power n and puts the
result in F[][]*/
static void power(int F[][], int n)
{
int i;
int M[][] = new int[][]{{1,1},{1,0}};
// n - 1 times multiply the matrix to {{1,0},{0,1}}
for (i = 2; i <= n; i++)
multiply(F, M);
}
This can be optimized to work in O(Logn) time complexity. We can do recursive multiplication to get power(M, n) in the previous method.
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
static void power(int F[][], int n)
{
if( n == 0 || n == 1)
return;
int M[][] = new int[][]{{1,1},{1,0}};
power(F, n/2);
multiply(F, F);
if (n%2 != 0)
multiply(F, M);
}
Method 3 (O(log n) Time)
Below is one more interesting recurrence formula that can be used to find nth Fibonacci Number in O(log n) time.
If n is even then k = n/2:
F(n) = [2*F(k-1) + F(k)]*F(k)
If n is odd then k = (n + 1)/2
F(n) = F(k)*F(k) + F(k-1)*F(k-1)
How does this formula work?
The formula can be derived from the above matrix equation.
fibonaccimatrix
Taking determinant on both sides, we get
(-1)n = Fn+1Fn-1 – Fn2
Moreover, since AnAm = An+m for any square matrix A, the following identities can be derived (they are obtained from two different coefficients of the matrix product)
FmFn + Fm-1Fn-1 = Fm+n-1
By putting n = n+1,
FmFn+1 + Fm-1Fn = Fm+n
Putting m = n
F2n-1 = Fn2 + Fn-12
F2n = (Fn-1 + Fn+1)Fn = (2Fn-1 + Fn)Fn (Source: Wiki)
To get the formula to be proved, we simply need to do the following
If n is even, we can put k = n/2
If n is odd, we can put k = (n+1)/2
public static int fib(int n)
{
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n] != 0)
return f[n];
int k = (n & 1) == 1? (n + 1) / 2
: n / 2;
// Applyting above formula [See value
// n&1 is 1 if n is odd, else 0.
f[n] = (n & 1) == 1? (fib(k) * fib(k) +
fib(k - 1) * fib(k - 1))
: (2 * fib(k - 1) + fib(k))
* fib(k);
return f[n];
}
Method 4 - Using a formula
In this method, we directly implement the formula for the nth term in the Fibonacci series. Time O(1) Space O(1)
Fn = {[(√5 + 1)/2] ^ n} / √5
static int fib(int n) {
double phi = (1 + Math.sqrt(5)) / 2;
return (int) Math.round(Math.pow(phi, n)
/ Math.sqrt(5));
}
Reference: http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html
Look here for implementation in Erlang which uses formula
. It shows nice linear resulting behavior because in O(M(n) log n) part M(n) is exponential for big numbers. It calculates fib of one million in 2s where result has 208988 digits. The trick is that you can compute exponentiation in O(log n) multiplications using (tail) recursive formula (tail means with O(1) space when used proper compiler or rewrite to cycle):
% compute X^N
power(X, N) when is_integer(N), N >= 0 ->
power(N, X, 1).
power(0, _, Acc) ->
Acc;
power(N, X, Acc) ->
if N rem 2 =:= 1 ->
power(N - 1, X, Acc * X);
true ->
power(N div 2, X * X, Acc)
end.
where X and Acc you substitute with matrices. X will be initiated with and Acc with identity I equals to .
One simple way is to calculate it iteratively instead of recursively. This will calculate F(n) in linear time.
def fib(n):
a,b = 0,1
for i in range(n):
a,b = a+b,a
return a
Hint: One way you achieve faster results is by using Binet's formula:
Here is a way of doing it in Python:
from decimal import *
def fib(n):
return int((Decimal(1.6180339)**Decimal(n)-Decimal(-0.6180339)**Decimal(n))/Decimal(2.236067977))
you can save your results and use them :
public static long[] fibs;
public long fib(int n) {
fibs = new long[n];
return internalFib(n);
}
public long internalFib(int n) {
if (n<=2) return 1;
fibs[n-1] = fibs[n-1]==0 ? internalFib(n-1) : fibs[n-1];
fibs[n-2] = fibs[n-2]==0 ? internalFib(n-2) : fibs[n-2];
return fibs[n-1]+fibs[n-2];
}
F(n) = (φ^n)/√5 and round to nearest integer, where φ is the golden ratio....
φ^n can be calculated in O(lg(n)) time hence F(n) can be calculated in O(lg(n)) time.
// D Programming Language
void vFibonacci ( const ulong X, const ulong Y, const int Limit ) {
// Equivalent : if ( Limit != 10 ). Former ( Limit ^ 0xA ) is More Efficient However.
if ( Limit ^ 0xA ) {
write ( Y, " " ) ;
vFibonacci ( Y, Y + X, Limit + 1 ) ;
} ;
} ;
// Call As
// By Default the Limit is 10 Numbers
vFibonacci ( 0, 1, 0 ) ;
EDIT: I actually think Hynek Vychodil's answer is superior to mine, but I'm leaving this here just in case someone is looking for an alternate method.
I think the other methods are all valid, but not optimal. Using Binet's formula should give you the right answer in principle, but rounding to the closest integer will give some problems for large values of n. The other solutions will unnecessarily recalculate the values upto n every time you call the function, and so the function is not optimized for repeated calling.
In my opinion the best thing to do is to define a global array and then to add new values to the array IF needed. In Python:
import numpy
fibo=numpy.array([1,1])
last_index=fibo.size
def fib(n):
global fibo,last_index
if (n>0):
if(n>last_index):
for i in range(last_index+1,n+1):
fibo=numpy.concatenate((fibo,numpy.array([fibo[i-2]+fibo[i-3]])))
last_index=fibo.size
return fibo[n-1]
else:
print "fib called for index less than 1"
quit()
Naturally, if you need to call fib for n>80 (approximately) then you will need to implement arbitrary precision integers, which is easy to do in python.
This will execute faster, O(n)
def fibo(n):
a, b = 0, 1
for i in range(n):
if i == 0:
print(i)
elif i == 1:
print(i)
else:
temp = a
a = b
b += temp
print(b)
n = int(input())
fibo(n)

Resources