memory efficient method to calculate distance matrix [duplicate] - r

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.
The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.
Edit
The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:
You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.

Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the #address and other initializations.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
My little wrapper
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat#address, zeros#address)
return(zeros)
}
The test
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE

Related

Setting colnames in R's cpp11

Provided that cpp11 does not provide any "sugar", we need to use attributes.
I am trying to set colnames in a C++ function, as in the next MWE
#include "cpp11.hpp"
using namespace cpp11;
// THIS WORKS
[[cpp11::register]]
doubles cpp_names(writable::doubles X) {
X.attr("names") = {"a","b"};
return X;
}
// THIS WON'T
[[cpp11::register]]
doubles_matrix<> cpp_colnames(writable::doubles_matrix<> X) {
X.attr("dimnames") = list(NULL, {"A","B"});
return X;
}
How can I pass a list with two vectors, one NULL and the other ("a","b"), that can be correctly converted to a SEXP?
In Rcpp, one would do colnames(X) = {"a","b"}.
My approach can be wrong.

optimParallel can not find Rcpp function [duplicate]

I've written a function in Rcpp and compiled it with inline. Now, I want to run it in parallel on different cores, but I'm getting a strange error. Here's a minimal example, where the function funCPP1 can be compiled and runs well by itself, but cannot be called by snow's clusterCall function. The function runs well as a single process, but gives the following error when ran in parallel:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: NULL value passed as symbol address
And here is some code:
## Load and compile
library(inline)
library(Rcpp)
library(snow)
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
funCPP1 <- cxxfunction(signature(xbe = "numeric", g="numeric"),body = src1, plugin="Rcpp")
## Single process
A <- matrix(rnorm(400), 20,20)
funCPP1(A, 0.5)
## Parallel
cl <- makeCluster(2, type = "SOCK")
clusterExport(cl, 'funCPP1')
clusterCall(cl, funCPP1, A, 0.5)
Think it through -- what does inline do? It creates a C/C++ function for you, then compiles and links it into a dynamically-loadable shared library. Where does that one sit? In R's temp directory.
So you tried the right thing by shipping the R frontend calling that shared library to the other process (which has another temp directory !!), but that does not get the dll / so file there.
Hence the advice is to create a local package, install it and have both snow processes load and call it.
(And as always: better quality answers may be had on the rcpp-devel list which is read by more Rcpp constributors than SO is.)
Old question, but I stumbled across it while looking through the top Rcpp tags so maybe this answer will be of use still.
I think Dirk's answer is proper when the code you've written is fully de-bugged and does what you want, but it can be a hassle to write a new package for such as small piece of code like in the example. What you can do instead is export the code block, export a "helper" function that compiles source code and run the helper. That'll make the CXX function available, then use another helper function to call it. For instance:
# Snow must still be installed, but this functionality is now in "parallel" which ships with base r.
library(parallel)
# Keep your source as an object
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
# Save the signature
sig <- signature(xbe = "numeric", g="numeric")
# make a function that compiles the source, then assigns the compiled function
# to the global environment
c.inline <- function(name, sig, src){
library(Rcpp)
funCXX <- inline::cxxfunction(sig = sig, body = src, plugin="Rcpp")
assign(name, funCXX, envir=.GlobalEnv)
}
# and the function which retrieves and calls this newly-compiled function
c.namecall <- function(name,...){
funCXX <- get(name)
funCXX(...)
}
# Keep your example matrix
A <- matrix(rnorm(400), 20,20)
# What are we calling the compiled funciton?
fxname <- "TestCXX"
## Parallel
cl <- makeCluster(2, type = "PSOCK")
# Export all the pieces
clusterExport(cl, c("src1","c.inline","A","fxname"))
# Call the compiler function
clusterCall(cl, c.inline, name=fxname, sig=sig, src=src1)
# Notice how the function now named "TestCXX" is available in the environment
# of every node?
clusterCall(cl, ls, envir=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name=fxname, A, 0.5)
# Works with my testing
I've written a package ctools (shameless self-promotion) which wraps up a lot of the functionality that is in the parallel and Rhpc packages for cluster computing, both with PSOCK and MPI. I already have a function called "c.sourceCpp" which calls "Rcpp::sourceCpp" on every node in much the same way as above. I'm going to add in a "c.inlineCpp" which does the above now that I see the usefulness of it.
Edit:
In light of Coatless' comments, the Rcpp::cppFunction() in fact negates the need for the c.inline helper here, though the c.namecall is still needed.
src2 <- '
NumericMatrix TestCpp(NumericMatrix xbe, int g){
NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
}
'
clusterCall(cl, Rcpp::cppFunction, code=src2, env=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name="TestCpp", A, 0.5)
I resolved it by sourcing on each cluster cluster node an R file with the wanted C inline function:
clusterEvalQ(cl,
{
library(inline)
invisible(source("your_C_func.R"))
})
And your file your_C_func.R should contain the C function definition:
c_func <- cfunction(...)

How is noNA used in Rcpp?

In his "Advanced R" book, Hadley Wickham says "noNA(x) asserts that the vector x does not contain any missing values." However I still don't know how to use it. I can't do
if (noNA(x))
do this
so how am I supposed to use it?
http://adv-r.had.co.nz/Rcpp.html#rcpp-sugar
Many of the Rcpp sugar expressions are implemented through template classes which have specializations for cases when the input object is known to be free of missing values, thereby allowing the underlying algorithm to avoid having to perform the extra work of dealing with NA values (e.g. calls to is_na). This is only possible because the VectorBase class has a boolean parameter indicating whether the underlying object can (can, not that it necessarily does) have NA values, or not.
noNA returns (when called on a VectorBase object) an instance of the Nona template class. Note that Nona itself derives from
Rcpp::VectorBase<RTYPE, false, Nona<RTYPE,NA,VECTOR>>
// ^^^^^
meaning that the returned object gets encoded with information that essentially says "you can assume that this data is free of NA values".
As an example, Rcpp::sum is implemented via the Sum class in the Rcpp::sugar namespace. In the default case, we see that there is extra work to manage the possibility of missing values:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
STORAGE current ;
for( R_xlen_t i=0; i<n; i++){
current = object[i] ;
if( Rcpp::traits::is_na<RTYPE>(current) ) // here
return Rcpp::traits::get_na<RTYPE>() ; // here
result += current ;
}
return result ;
}
On the other hand, there is also a specialization for cases when the input does not have missing values, in which the algorithm does less work:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
for( R_xlen_t i=0; i<n; i++){
result += object[i] ;
}
return result ;
}
To answer your question of "how do I apply this in practice?", here is an example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Sum(IntegerVector x) {
return sum(x);
}
// [[Rcpp::export]]
int SumNoNA(IntegerVector x) {
return sum(noNA(x));
}
Benchmarking these two functions,
set.seed(123)
x <- as.integer(rpois(1e6, 25))
all.equal(Sum(x), SumNoNA(x))
# [1] TRUE
microbenchmark::microbenchmark(
Sum(x),
SumNoNA(x),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Sum(x) 577.386 664.620 701.2422 677.1640 731.7090 1214.447 500
# SumNoNA(x) 454.990 517.709 556.5783 535.1935 582.7065 1138.426 500
the noNA version is indeed faster.

Rcpp: extract subset of matrix using indexmatrix

I have a question about subsetting from a matrix to a vector. The user has the possibility to explicitly give the indexmatrix (which is a matrix of the same size as M, with 0 if the entry is not wanted, and 1 if the entry has to be extracted). If the indexmatrix is provided, then we just subset it, and if the indexmatrix is not provided (indexmatrix = NULL), then we build it using type1 (which takes true or false). Only two types of indexmatrices are possible.
I used the subsetting technique provided in
Subset of a Rcpp Matrix that matches a logical statement
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::colvec extractElementsRcpp(arma::mat M,
Rcpp::Nullable<Rcpp::NumericMatrix> indexmatrix = R_NilValue,
bool type1 = false) {
unsigned int D = M.n_rows; // dimension of the data
arma::mat indmatrix(D, D); // initialize indexmatrix
if (indexmatrix.isNotNull()) {
// copy indexmatrix to numericmatrix
Rcpp::NumericMatrix indexmatrixt(indexmatrix);
// make indexmatrix into arma matrix indmatrix
indmatrix = Rcpp::as<arma::mat>(indexmatrixt);
} //else {
// get indexmatrix
// Rcpp::NumericMatrix indexmatrixt = getindexmatrix(D, type1)["indexmatrix"];
// // make indexmatrix into arma matrix
// indmatrix = Rcpp::as<arma::mat>(indexmatrixt);
// }
arma::colvec unM = M.elem(find(indmatrix == 1)); // extract wanted elements
return(unM);
}
It works, great! However, the speed is not what I was hoping for. Whenever the indexmatrix is provided, the C++ code is slower than the normal R code, while I was aiming for a nice improvement in speed. I have the feeling I'm copying the matrices around too much. But I am new to C++ and did not find a way to avoid it yet.
The speed comparison is as follows:
test replications elapsed relative
2 extractElementsR(M, indexmatrix = ind) 100 0.084 1.00
1 extractElementsRcpp(M, indexmatrix = ind) 100 0.142 1.69
EDIT: The R function is defined as
extractElementsR <- function (M, indexmatrix, type1 = FALSE) {
D <- nrow(M)
# # get indexmatrix, if necessary
# if(is.null(indexmatrix)) indexmatrix <- getindexmatrix(D, type1 = type1)$indexmatrix
# extract wanted elements
return (M[which(indexmatrix > 0)])
}
One could for example take the matrices
M <- matrix(rnorm(1000^2), ncol = 1000)
indexmatrix <- matrix(1, 1000, 1000)
indexmatrix[lower.tri(indexmatrix)] <- 0
as M and indexmatrix.
EDIT2: I commented the else statement in the Rcpp function and omitted the default NULL value in the R function as it is not important for my question. I want to improve the speed of the Rcpp function when indexmatrix is provided. However, I want to keep the default NULL value (and create and indexmatrix when necessary).
Can you show
the function extractElementR() as well and
example data so that this become a reproducible example?
And at first blush, you are mixing Rcpp and RcppArmadillo types in order to subset with the latter. That will create lots of copies. We can now index with both Rcpp (and Kevin has some answers here) and RcppArmadillo (several older answers) so you could even try two different ways.

Using Rcpp within parallel code via snow to make a cluster

I've written a function in Rcpp and compiled it with inline. Now, I want to run it in parallel on different cores, but I'm getting a strange error. Here's a minimal example, where the function funCPP1 can be compiled and runs well by itself, but cannot be called by snow's clusterCall function. The function runs well as a single process, but gives the following error when ran in parallel:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: NULL value passed as symbol address
And here is some code:
## Load and compile
library(inline)
library(Rcpp)
library(snow)
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
funCPP1 <- cxxfunction(signature(xbe = "numeric", g="numeric"),body = src1, plugin="Rcpp")
## Single process
A <- matrix(rnorm(400), 20,20)
funCPP1(A, 0.5)
## Parallel
cl <- makeCluster(2, type = "SOCK")
clusterExport(cl, 'funCPP1')
clusterCall(cl, funCPP1, A, 0.5)
Think it through -- what does inline do? It creates a C/C++ function for you, then compiles and links it into a dynamically-loadable shared library. Where does that one sit? In R's temp directory.
So you tried the right thing by shipping the R frontend calling that shared library to the other process (which has another temp directory !!), but that does not get the dll / so file there.
Hence the advice is to create a local package, install it and have both snow processes load and call it.
(And as always: better quality answers may be had on the rcpp-devel list which is read by more Rcpp constributors than SO is.)
Old question, but I stumbled across it while looking through the top Rcpp tags so maybe this answer will be of use still.
I think Dirk's answer is proper when the code you've written is fully de-bugged and does what you want, but it can be a hassle to write a new package for such as small piece of code like in the example. What you can do instead is export the code block, export a "helper" function that compiles source code and run the helper. That'll make the CXX function available, then use another helper function to call it. For instance:
# Snow must still be installed, but this functionality is now in "parallel" which ships with base r.
library(parallel)
# Keep your source as an object
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
# Save the signature
sig <- signature(xbe = "numeric", g="numeric")
# make a function that compiles the source, then assigns the compiled function
# to the global environment
c.inline <- function(name, sig, src){
library(Rcpp)
funCXX <- inline::cxxfunction(sig = sig, body = src, plugin="Rcpp")
assign(name, funCXX, envir=.GlobalEnv)
}
# and the function which retrieves and calls this newly-compiled function
c.namecall <- function(name,...){
funCXX <- get(name)
funCXX(...)
}
# Keep your example matrix
A <- matrix(rnorm(400), 20,20)
# What are we calling the compiled funciton?
fxname <- "TestCXX"
## Parallel
cl <- makeCluster(2, type = "PSOCK")
# Export all the pieces
clusterExport(cl, c("src1","c.inline","A","fxname"))
# Call the compiler function
clusterCall(cl, c.inline, name=fxname, sig=sig, src=src1)
# Notice how the function now named "TestCXX" is available in the environment
# of every node?
clusterCall(cl, ls, envir=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name=fxname, A, 0.5)
# Works with my testing
I've written a package ctools (shameless self-promotion) which wraps up a lot of the functionality that is in the parallel and Rhpc packages for cluster computing, both with PSOCK and MPI. I already have a function called "c.sourceCpp" which calls "Rcpp::sourceCpp" on every node in much the same way as above. I'm going to add in a "c.inlineCpp" which does the above now that I see the usefulness of it.
Edit:
In light of Coatless' comments, the Rcpp::cppFunction() in fact negates the need for the c.inline helper here, though the c.namecall is still needed.
src2 <- '
NumericMatrix TestCpp(NumericMatrix xbe, int g){
NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
}
'
clusterCall(cl, Rcpp::cppFunction, code=src2, env=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name="TestCpp", A, 0.5)
I resolved it by sourcing on each cluster cluster node an R file with the wanted C inline function:
clusterEvalQ(cl,
{
library(inline)
invisible(source("your_C_func.R"))
})
And your file your_C_func.R should contain the C function definition:
c_func <- cfunction(...)

Resources