I'm working with large matrices of about 2500x2500x50 (lonxlatxtime). The matrix contains only 1 and 0. I need to know for each timestep the sum of the 24 surrounding elements. So far I did it about this way:
xdim <- 2500
ydim <- 2500
tdim <- 50
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
This works, but it is much too slow for my needs. Has anybody please an advice how to speed up?
Intro
I have to say, there are so many hidden things behind just the setup of the arrays. The remainder of the problem is trivial though. As a result, there are two ways to go about it really:
Bruteforce given by #Alex (written in C++)
Observing replication patterns
Bruteforce with OpenMP
If we want to 'brute force' it, then we can use the suggestion given by #Alex to employ OpenMP with Armadillo
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Add a flag to enable OpenMP at compile time
// [[Rcpp::plugins(openmp)]]
// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::export]]
arma::cube cube_parallel(arma::cube a, arma::cube res, int cores = 1) {
// Extract the different dimensions
unsigned int tdim = res.n_slices;
unsigned int xdim = res.n_rows;
unsigned int ydim = res.n_cols;
// Same calculation loop
#pragma omp parallel for num_threads(cores)
for (unsigned int t = 0; t < tdim; t++){
// pop the T
arma::mat temp_mat = a.slice(t);
// Subset the rows
for (unsigned int x = 2; x < xdim-2; x++){
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
// Iterate over the columns with unit accumulative sum
for (unsigned int y = 2; y < ydim-2; y++){
res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2));
}
}
}
return res;
}
Replication Patterns
However, the smarter approach is understanding how the array(0:1, dims) is being constructed.
Most notably:
Case 1: If xdim is even, then only the rows of a matrix alternate.
Case 2: If xdim is odd and ydim is odd, then rows alternate as well as the matrices alternate.
Case 3: If xdim is odd and ydim is even, then only the rows alternate
Examples
Let's see the cases in action to observe the patterns.
Case 1:
xdim <- 2
ydim <- 3
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 1 1
, , 2
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 1 1
Case 2:
xdim <- 3
ydim <- 3
tdim <- 3
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 1
[3,] 0 1 0
, , 2
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 0
[3,] 1 0 1
, , 3
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 1
[3,] 0 1 0
Case 3:
xdim <- 3
ydim <- 4
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3] [,4]
[1,] 0 1 0 1
[2,] 1 0 1 0
[3,] 0 1 0 1
, , 2
[,1] [,2] [,3] [,4]
[1,] 0 1 0 1
[2,] 1 0 1 0
[3,] 0 1 0 1
Pattern Hacking
Alrighty, based on the above discussion, we opt to make a bit of code the exploits this unique pattern.
Creating Alternating Vectors
An alternating vector in this case switches between two different values.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// ------- Make Alternating Vectors
arma::vec odd_vec(unsigned int xdim){
// make a temporary vector to create alternating 0-1 effect by row.
arma::vec temp_vec(xdim);
// Alternating vector (anyone have a better solution? )
for (unsigned int i = 0; i < xdim; i++) {
temp_vec(i) = (i % 2 ? 0 : 1);
}
return temp_vec;
}
arma::vec even_vec(unsigned int xdim){
// make a temporary vector to create alternating 0-1 effect by row.
arma::vec temp_vec(xdim);
// Alternating vector (anyone have a better solution? )
for (unsigned int i = 0; i < xdim; i++) {
temp_vec(i) = (i % 2 ? 1 : 0); // changed
}
return temp_vec;
}
Creating the three cases of matrix
As mentioned above, there are three cases of matrix. The even, first odd, and second odd cases.
// --- Handle the different cases
// [[Rcpp::export]]
arma::mat make_even_matrix(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
temp_mat.each_col() = even_vec(xdim);
return temp_mat;
}
// xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case1(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
arma::vec e_vec = even_vec(xdim);
arma::vec o_vec = odd_vec(xdim);
// Alternating column
for (unsigned int i = 0; i < ydim; i++) {
temp_mat.col(i) = (i % 2 ? o_vec : e_vec);
}
return temp_mat;
}
// xdim is odd and ydim is odd
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
arma::vec e_vec = even_vec(xdim);
arma::vec o_vec = odd_vec(xdim);
// Alternating column
for (unsigned int i = 0; i < ydim; i++) {
temp_mat.col(i) = (i % 2 ? e_vec : o_vec); // slight change
}
return temp_mat;
}
Calculation Engine
Same as the previous solution, just without the t as we no longer need to repeat calculations.
// --- Calculation engine
// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat){
unsigned int xdim = temp_mat.n_rows;
unsigned int ydim = temp_mat.n_cols;
arma::mat res = temp_mat;
// Subset the rows
for (unsigned int x = 2; x < xdim-2; x++){
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
// Iterate over the columns with unit accumulative sum
for (unsigned int y = 2; y < ydim-2; y++){
res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
}
}
return res;
}
Call Main Function
Here is the core function that pieces everything together. This gives us the desired distance arrays.
// --- Main Engine
// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {
// Initialize values in A
arma::cube res(xdim,ydim,tdim);
if(xdim % 2 == 0){
res.each_slice() = calc_matrix(make_even_matrix(xdim, ydim));
}else{
if(ydim % 2 == 0){
res.each_slice() = calc_matrix(make_odd_matrix_case1(xdim, ydim));
}else{
arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case1(xdim, ydim));
arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));
for(unsigned int t = 0; t < tdim; t++){
res.slice(t) = (t % 2 ? sec_odd_mat : first_odd_mat);
}
}
}
return res;
}
Timing
Now, the real truth is how well does this perform:
Unit: microseconds
expr min lq mean median uq max neval
r_1core 3538.022 3825.8105 4301.84107 3957.3765 4043.0085 16856.865 100
alex_1core 2790.515 2984.7180 3461.11021 3076.9265 3189.7890 15371.406 100
cpp_1core 174.508 180.7190 197.29728 194.1480 204.8875 338.510 100
cpp_2core 111.960 116.0040 126.34508 122.7375 136.2285 162.279 100
cpp_3core 81.619 88.4485 104.54602 94.8735 108.5515 204.979 100
cpp_cache 40.637 44.3440 55.08915 52.1030 60.2290 302.306 100
Script used for timing:
cpp_parallel = cube_parallel(a,res, 1)
alex_1core = alex(a,res,xdim,ydim,tdim)
cpp_cache = dim_to_cube(xdim,ydim,tdim)
op_answer = cube_r(a,res,xdim,ydim,tdim)
all.equal(cpp_parallel, op_answer)
all.equal(cpp_cache, op_answer)
all.equal(alex_1core, op_answer)
xdim <- 20
ydim <- 20
tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
ga = microbenchmark::microbenchmark(r_1core = cube_r(a,res,xdim,ydim,tdim),
alex_1core = alex(a,res,xdim,ydim,tdim),
cpp_1core = cube_parallel(a,res, 1),
cpp_2core = cube_parallel(a,res, 2),
cpp_3core = cube_parallel(a,res, 3),
cpp_cache = dim_to_cube(xdim,ydim,tdim))
Here's one solution that's fast for a large array:
res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
dim(res) <- c(xdim, ydim, tdim)
I filtered the array using rep(1,5) as the weights (i.e. sum values within a neighborhood of 2) along each dimension. I then modified the dim attribute since it initially comes out as a matrix.
Note that this wraps the sum around at the edges of the array (which might make sense since you're looking at latitude and longitude; if not, I can modify my answer).
For a concrete example:
xdim <- 500
ydim <- 500
tdim <- 15
a <- array(0:1,dim=c(xdim,ydim,tdim))
and here's what you're currently using (with NAs at the edges) and how long this example takes on my laptop:
f1 <- function(a, xdim, ydim, tdim){
res <- array(NA_integer_,dim=c(xdim,ydim,tdim))
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
return(res)
}
system.time(res1 <- f1(a, xdim, ydim, tdim))
# user system elapsed
# 14.813 0.005 14.819
And here's a comparison with the version I described:
f2 <- function(a, xdim, ydim, tdim){
res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
dim(res) <- c(xdim, ydim, tdim)
return(res)
}
system.time(res2 <- f2(a, xdim, ydim, tdim))
# user system elapsed
# 1.188 0.047 1.236
You can see there's a significant speed boost (for large arrays). And to check that it's giving the correct solution (note that I'm adding NAs so both results match, since the one I gave filters in a circular manner):
## Match NAs
res2NA <- ifelse(is.na(res1), NA, res2)
all.equal(res2NA, res1)
# [1] TRUE
I'll add that your full array (2500x2500x50) took just under a minute (about 55 seconds), although it did use a lot of memory in the process, FYI.
Your current code has a lot of overhead from redundant subsetting and calculation. Clean this up if you want better speed.
At xdim <- ydim <- 20; tdim <- 5, I see a 23% speedup on my machine.
At xdim <- ydim <- 200; tdim <- 10, I see a 25% speedup.
This comes at small cost of additional memory, which is obvious by examining the code below.
xdim <- ydim <- 20; tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
microbenchmark(op= {
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
},
alex= {
for (t in 1:tdim){
temp <- a[,,t]
for (x in 3:(xdim-2)){
temp2 <- temp[(x-2):(x+2),]
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(temp2[,(y-2):(y+2)])
}
}
}
}, times = 50)
Unit: milliseconds
expr min lq mean median uq max neval cld
op 4.855827 5.134845 5.474327 5.321681 5.626738 7.463923 50 b
alex 3.720368 3.915756 4.213355 4.012120 4.348729 6.320481 50 a
Further improvements:
If you write this in C++, my guess is that recognizing res[x,y,t] = res[x,y-1,t] - sum(a[...,y-2,...]) + sum(a[...,y+2,...]) will save you additional time. In R, it did not in my timing tests.
This problem is also embarrassingly parallel. There's no reason you couldn't split the t dimension to make more use of a multi-core architecture.
Both of these are left to the reader / OP.
Related
I am trying to create a matrix with random numbers where the rowSums should exactly be 1.
I already have a condition which checks if the rowSums is not 1 and tries to correct it.
When I print out the result it looks correct but if I test if all values are 1 it gives me some FALSE values.
How can I correct that?
library(Rcpp)
cppFunction('
NumericMatrix imembrandc(int n, int k) {
NumericMatrix u( n , k );
IntegerVector sequ = seq(1,100);
NumericVector sampled;
for (int i=0; i < k; ++i) {
sampled = sample(sequ, n);
u(_,i) = sampled / sum(sampled);
}
if (is_true(any(rowSums(u) != 1))) {
u(_,1) = u(_,1) + (1 - rowSums(u));
}
return(u);
}')
When I print out the rowSums of the result it looks correct:
res = imembrandc(n = 10, k = 5)
rowSums(res)
[1] 1 1 1 1 1 1 1 1 1 1
But checking it gives some FALSEs:
rowSums(res) == 1
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
The canonical way to generate n random numbers that sum to 1 is to generate n - 1 values from [0,1), add 0 and 1 to the list and take the difference of the sorted list. Of course, this depends on the distribution you want for the random numbers. This can be expressed in R as
set.seed(42)
v <- diff(sort(c(0, runif(5), 1)))
v
#> [1] 0.28613953 0.35560598 0.18870211 0.08435842 0.02226937 0.06292459
sum(v)
#> [1] 1
Created on 2019-05-24 by the reprex package (v0.2.1)
In your case in C++:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix imembrandc(int n, int k) {
NumericMatrix u(n, k);
for (int i = 0; i < n; ++i) {
NumericVector row = runif(k - 1);
row.push_back(0.0);
row.push_back(1.0);
u(i, _) = diff(row.sort());
}
return u;
}
/*** R
set.seed(42)
res = imembrandc(n = 10, k = 5)
rowSums(res)
rowSums(res) == 1
all.equal(rowSums(res),rep(1, nrow(res)))
*/
Note that I am generating rows to begin with, while you were generating columns and then tried to correct the rowSum. Output:
> set.seed(42)
> res = imembrandc(n = 10, k = 5)
> rowSums(res)
[1] 1 1 1 1 1 1 1 1 1 1
> rowSums(res) == 1
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> all.equal(rowSums(res),rep(1, nrow(res)))
[1] TRUE
BTW, all.equal gives TRUE also for your matrix, since the difference is really small. But I find it better to avoid the problem from the beginning.
Using Rcpp/Armadillo, how can I efficiently extract/replace off-diagonal values of a square matrix? In R it can be achieved using: old_values = A[row(A) == (col(A) - k)]; A[row(A) == (col(A) - k)] = new_values. Using Armadillo a for-loop (see below) can be used to achive this goal. But is there a simpler way to write the code? As I need to do this operation for all k of a big matrix (>10000 rows, > 10000 columns), efficiency is better to be considered. Here is a reproducible example:
A = matrix(1:25, 5, 5)
A[row(A) == (col(A) - 3)] # extract the 3rd off-diagnal values
A[row(A) == (col(A) - 2)] = -5 of # replace the 2nd off-diagnal values with -5
The cpp code using a for loop:
arma::vec retrieve_off_diag_values( arma::mat A, unsigned k )
{
unsigned n_cols = A.n_cols;
arma::vec off_diag_values(n_cols - k);
for( unsigned i=0; i <(n_cols - k); i++ )
{
off_diag_values(i) = A(i, i+k);
}
return off_diag_values;
}
To extract values from a specified diagonal into a vector, where k < 0 indicates subdiagonals, k = 0 indicates the main diagonal, and k > 0 indicates superdiagonals:
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec diag_get(const arma::mat& X, int k) // note the 'const' and '&'
{
return X.diag(k);
}
To set values on a diagonal to a specific value:
// [[Rcpp::export]]
void diag_fill(arma::mat& X, int k, double value) // note the '&' character
{
X.diag(k).fill(value);
}
To change instances of a specific value on a diagonal with another value:
// [[Rcpp::export]]
void diag_change(arma::mat& X, int k, double old_value, double new_value)
{
X.diag(k).replace(old_value, new_value);
}
Retrieving off-diagonal
You can use armadillo's .diag() member function with an index k to retrieve off-diagonal.
Cases:
if k == 0 (default), then the main diagonal.
else if k < 0, then a lower triangular diagonal.
else then an upper triangular diagonal.
Example:
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec offdiag_extract(arma::mat& A, int k) {
return A.diag(k);
}
Test:
A = matrix(1:25, 5, 5)
offdiag_extract(A, 3)
# [,1]
# [1,] 16
# [2,] 22
Replacing off-diagonal
Edit: This section has been updated thanks to #mtall's point regarding behavior provided by other member functions.
However, .diag()= can only be used to save into the main diagonal. To ensure the replacement being viable for all diagonals, you would need to chain the .diag() member function with .fill(value), e.g.
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat offdiag_fill_arma(arma::mat& A, int k, double replace_value) {
A.diag(k).fill(replace_value);
return A;
}
Test:
offdiag_fill_arma(A, 2, 4)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 4 16 21
# [2,] 2 7 12 4 22
# [3,] 3 8 13 18 4
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
Implementing off-diagonal replacement
In short, one could implement off-diagonal replacement using a single for loop with the appropriate k offset.
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat offdiag_replace(arma::mat& A, int k, double replace_val = -5) {
// Determine whether to go over upper or lower diagonal
unsigned int row_offset = (k < 0) ? -k : 0;
unsigned int col_offset = (k > 0) ? k : 0;
// Compute total number of elements
unsigned int N = std::min(A.n_rows - row_offset, A.n_cols - col_offset);
// Loop over diagonal
for(unsigned int i = 0; i < N; ++i) {
unsigned int row = i + row_offset;
unsigned int col = i + col_offset;
// Disregard bounds checks with .at()
A.at(row,col) = replace_val;
}
return A;
}
Test:
offdiag_replace(A, 2, 4)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 4 16 21
# [2,] 2 7 12 4 22
# [3,] 3 8 13 18 4
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
Is there a more effective way to match matrix rows when using large matrices?
I have a vector with values that correspond to a matrix of 2^N rows. N is typically large e.g., >20. Each row is a unique combination of N={0,1} values and represents a 'position' on a decision space. I.e., for N=3 the rows would be
0 0 0,
0 0 1,
0 1 0,
1 0 0,
...,
1 1 1
I need to determine whether a position is a local maximum, i.e., whether the N neighboring positions are of lower values. For example, to the position 0 0 0, the neighboring positions are 1 0 0, 0 1 0, and 0 0 1, accordingly.
I have coded the following solution that does the job but very slowly for large N.
library(prodlim) #for row.match command
set.seed(1234)
N=10
space = as.matrix(expand.grid(rep(list(0:1), N))) #creates all combinations of 0-1 along N-dimensions
performance = replicate(2^N, runif(1, min=0, max=1)) #corresponding values for each space-row (position)
#determine whether a space position is a local maxima, that is, the N neighboring positions are smaller in performance value
system.time({
local_peaks_pos = matrix(NA,nrow=2^N, ncol=1)
for(v in 1:2^N)
{
for(q in 1:N)
{
temp_local_pos = space[v,1:N]
temp_local_pos[q] = abs(temp_local_pos[q]-1)
if(performance[row.match(temp_local_pos[1:N], space[,1:N])] > performance[v])
{
local_peaks_pos[v,1] = 0
break
}
}
}
local_peaks_pos[is.na(local_peaks_pos)] = 1
})
user system elapsed
9.94 0.05 10.06
As Gabe mentioned in his comment,
you can exploit the fact that your decision space can be interpreted as single integers:
set.seed(1234L)
N <- 10L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2L ^ (0L:(N - 1L))))
is_local_max <- sapply(0L:(2^N - 1), function(i) {
multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
multipliers[multipliers == 0L] <- 1L
neighbors <- i + powers_of_two * multipliers
# compensate that R vectors are 1-indexed
!any(performance[neighbors + 1L] > performance[i + 1L])
})
# compensate again
local_peaks_int <- which(is_local_max) - 1L
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
as.integer(rev(intToBits(int)[1L:N]))
}))
> head(local_peaks_binary)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 1 0 0
[2,] 0 0 0 0 1 0 0 1 1 0
[3,] 0 0 0 0 1 1 1 1 0 0
[4,] 0 0 0 1 0 0 0 1 1 1
[5,] 0 0 0 1 0 1 0 1 0 1
[6,] 0 0 0 1 1 0 1 1 1 0
In decimal,
multipliers contains the the sign of the powers_of_two so that,
when added to the current integer,
it represents a bit flip in binary.
For example,
if the original binary was 0 0 and we flip one bit to get 1 0,
it's as if we added 2^1 in decimal,
but if it was originally 1 0 and we flip one bit to get 0 0,
then we subtracted 2^1 in decimal.
Each row in local_peaks_binary is a binary from your decision space,
where the least significant bit is on the right.
So, for example, the first local peak is a decimal 4.
See this question for the mapping of integers to binary.
EDIT: and if you want to do it in parallel:
library(doParallel)
set.seed(1234L)
N <- 20L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2 ^ (0:(N - 1))))
num_cores <- detectCores()
workers <- makeCluster(num_cores)
registerDoParallel(workers)
chunks <- splitIndices(length(performance), num_cores)
chunks <- lapply(chunks, "-", 1L)
local_peaks_int <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE) %dopar% {
is_local_max <- sapply(chunk, function(i) {
multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
multipliers[multipliers == 0L] <- 1L
neighbors <- i + powers_of_two * multipliers
# compensate that R vectors are 1-indexed
!any(performance[neighbors + 1L] > performance[i + 1L])
})
# return
chunk[is_local_max]
}
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
as.integer(rev(intToBits(int)[1L:N]))
}))
stopCluster(workers); registerDoSEQ()
The above completes in ~2.5 seconds in my system,
which has 4 CPU cores.
Here is a C++ version that uses multi-threading but,
at least in my system with 4 threads,
it doesn't seem faster than Gabe's Fortran version.
However, when I try to run Gabe's Fortran code in a new session,
I get the following error with N <- 29L:
cannot allocate vector of size 4.0 Gb.
EDIT: Apparently I changed something important along the way,
because after testing again,
the C++ version actually seems faster.
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
#include <cstddef> // size_t
#include <vector>
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace std;
using namespace Rcpp;
using namespace RcppParallel;
class PeakFinder : public Worker
{
public:
PeakFinder(const NumericVector& performance, vector<int>& peaks, const int N)
: performance_(performance)
, peaks_(peaks)
, N_(N)
{ }
void operator()(size_t begin, size_t end) {
vector<int> peaks;
for (size_t i = begin; i < end; i++) {
bool is_local_peak = true;
unsigned int mask = 1;
for (int exponent = 0; exponent < N_; exponent++) {
unsigned int neighbor = static_cast<unsigned int>(i) ^ mask; // bitwise XOR
if (performance_[i] < performance_[neighbor]) {
is_local_peak = false;
break;
}
mask <<= 1;
}
if (is_local_peak)
peaks.push_back(static_cast<int>(i));
}
mutex_.lock();
peaks_.insert(peaks_.end(), peaks.begin(), peaks.end());
mutex_.unlock();
}
private:
const RVector<double> performance_;
vector<int>& peaks_;
const int N_;
tthread::mutex mutex_;
};
// [[Rcpp::export]]
IntegerVector local_peaks(const NumericVector& performance, const int N) {
vector<int> peaks;
PeakFinder peak_finder(performance, peaks, N);
// each thread call will check at least 1024 values
parallelFor(0, performance.length(), peak_finder, 1024);
IntegerVector result(peaks.size());
int i = 0;
for (int peak : peaks) {
result[i++] = peak;
}
return result;
}
After saving the C++ code in local-peaks.cpp,
this code:
library(Rcpp)
library(RcppParallel)
sourceCpp("local-peaks.cpp")
set.seed(1234L)
N <- 29L
performance <- runif(2^N)
system.time({
local_peaks_int <- local_peaks(performance, N)
})
finished in ~2 seconds
(without considering the time required to allocate performance).
If you do need the binary representation,
you can change local_peaks like this
(see this question):
// [[Rcpp::export]]
IntegerMatrix local_peaks(const NumericVector& performance, const int N) {
vector<int> peaks;
PeakFinder peak_finder(performance, peaks, N);
// each thread call will check at least 1024 values
parallelFor(0, performance.length(), peak_finder, 1024);
// in case you want the same order every time, #include <algorithm> and uncomment next line
// sort(peaks.begin(), peaks.end());
IntegerMatrix result(peaks.size(), N);
int i = 0;
for (int peak : peaks) {
for (int j = 0; j < N; j++) {
result(i, N - j - 1) = peak & 1;
peak >>= 1;
}
i++;
}
return result;
}
Here is one solution that follows the same general structure as your example code. intToBits and packBits map to and from the binary representation for each integer (subtracting one to start at zero). The inner loop flips each of the N bits to get the neighbors. On my laptop, this runs in a fraction of a second for N=10 and around a minute for N=20. The commented code stores some information from neighbors already tested so as to not redo the calculation. Uncommenting those lines makes it run in about 35 seconds for N=20.
loc_max <- rep(1, 2^N)
for (v in 1:2^N){
## if (loc_max[v] == 0) next
vbits <- intToBits(v-1)
for (q in 1:N){
tmp <- vbits
tmp[q] <- !vbits[q]
pos <- packBits(tmp, type = "integer") + 1
if (performance[pos] > performance[v]){
loc_max[v] <- 0
break
## } else {
## loc_max[pos] <- 0
}
}
}
identical(loc_max, local_peaks_pos[, 1])
## [1] TRUE
EDIT:
It sounds like you need every bit of speed possible, so here's another suggestion that relies on compiled code to run significantly faster than my first example. A fraction of a second for N=20 and a bit under 20 seconds for N=29 (the largest example I could fit in my laptop's RAM).
This is using a single core; you could either parallelize this, or alternatively run it in a single core and parallelize your Monte Carlo simulations instead.
library(inline)
loopcode <-
" integer v, q, pos
do v = 0, (2**N)-1
do q = 0, N-1
if ( btest(v,q) ) then
pos = ibclr(v, q)
else
pos = ibset(v, q)
end if
if (performance(pos) > performance(v)) then
loc_max(v) = 0
exit
end if
end do
end do
"
loopfun <- cfunction(sig = signature(performance="numeric", loc_max="integer", n="integer"),
dim=c("(0:(2**n-1))", "(0:(2**n-1))", ""),
loopcode,
language="F95")
N <- 20
performance = runif(2^N, min=0, max=1)
system.time({
floop <- loopfun(performance, rep(1, 2^N), N)
})
## user system elapsed
## 0.049 0.003 0.052
N <- 29
performance = runif(2^N, min=0, max=1)
system.time({
floop <- loopfun(performance, rep(1, 2^N), N)
})
## user system elapsed
## 17.892 1.848 19.741
I don't think pre-computing the neighbors would help much here since I'd guess the comparisons accessing different sections of such a large array are the most time consuming part.
I have a function where I need to make a table (tab, then change one value - the value where tab.names() == k, where k is given in the function call.
Looking at http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf, I've hoped that the following code would work (replacing "foo" with a variable name), but I guess that requires the element name to be static, and mine won't be. I've tried using which but that won't compile (invalid conversion from 'char' to 'Rcpp::traits::storage_type<16>::type {aka SEXPREC*}' - so I'm doing something wrong there.
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
char kc = '0' + k;
// what I need a working version of:
tab(kc) = 1; // gets ignored, as does a [] version of the same thing.
// or
tab('0' + k) = 1; // also ignored
int ki = which(all_proteins == kc); // gives me compile errors
// extra credit
// tab.names(k-1) = "-1";
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
# same thing in R:
expected_output <- table(x)
expected_output # before modification
# x
# 3 4 5 6 7 9 10 12
# 2 4 3 3 4 2 1 1
expected_output[as.character(k)] <- 1 # this is what I need help with
expected_output
# x
# 3 4 5 6 7 9 10 12
# 2 4 1 3 4 2 1 1
# extra credit:
names(expected_output)[as.character(k)] <- -1
*/
I'm still learning rcpp, and more importantly, still learning how to read the manual pages and plug in the right search terms into google/stackoverflow. I'm sure this is basic stuff (and I'm open to better methods - I currently think like an R programmer in terms of initial approaches to problems, not a C++ programmer.)
(BTW - The use of arma::vec is used in other parts of the code which I'm not showing for simplicity - I realize it's not useful here. I debated on switching it, but decided against it on the principle that I've tested that part, it works, and the last thing I want to do is introduce an extra bug...)
Thanks!
You can use the .findName() method to get the relevant index:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
int index = tab.findName(std::string(1, '0' + k));
tab(index) = 1;
all_proteins(index) = "-1";
tab.names() = all_proteins;
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
*/
Output:
> Rcpp::sourceCpp('table-name.cpp')
> set.seed(23)
> x <- rpois(20, 5)
> k <- 5
> fun(x, k)
3 4 -1 6 7 9 10 12
2 4 1 3 4 2 1 1
You could write your own function (use String instead of char):
int first_which_equal(const CharacterVector& x, String y) {
int n = x.size();
for (int i = 0; i < n; i++) {
if (x[i] == y) return(i);
}
return -1;
}
Also, it seems that tab(kc) is converting kc to an integer representation.
I'm a newbie to C++ and Rcpp. Suppose, I have a vector
t1<-c(1,2,NA,NA,3,4,1,NA,5)
and I want to get a index of elements of t1 that are NA. I can write:
NumericVector retIdxNA(NumericVector x) {
// Step 1: get the positions of NA in the vector
LogicalVector y=is_na(x);
// Step 2: count the number of NA
int Cnt=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
Cnt++;
}
}
// Step 3: create an output matrix whose size is same as that of NA
// and return the answer
NumericVector retIdx(Cnt);
int Cnt1=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
retIdx[Cnt1]=i+1;
Cnt1++;
}
}
return retIdx;
}
then I get
retIdxNA(t1)
[1] 3 4 8
I was wondering:
(i) is there any equivalent of which in Rcpp?
(ii) is there any way to make the above function shorter/crisper? In particular, is there any easy way to combine the Step 1, 2, 3 above?
Recent version of RcppArmadillo have functions to identify the indices of finite and non-finite values.
So this code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec whichNA(arma::vec x) {
return arma::find_nonfinite(x);
}
/*** R
t1 <- c(1,2,NA,NA,3,4,1,NA,5)
whichNA(t1)
*/
yields your desired answer (module the off-by-one in C/C++ as they are zero-based):
R> sourceCpp("/tmp/uday.cpp")
R> t1 <- c(1,2,NA,NA,3,4,1,NA,5)
R> whichNA(t1)
[,1]
[1,] 2
[2,] 3
[3,] 7
R>
Rcpp can do it too if you first create the sequence to subset into:
// [[Rcpp::export]]
Rcpp::IntegerVector which2(Rcpp::NumericVector x) {
Rcpp::IntegerVector v = Rcpp::seq(0, x.size()-1);
return v[Rcpp::is_na(x)];
}
Added to code above it yields:
R> which2(t1)
[1] 2 3 7
R>
The logical subsetting is also somewhat new in Rcpp.
Try this:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector which4( NumericVector x) {
int nx = x.size();
std::vector<int> y;
y.reserve(nx);
for(int i = 0; i < nx; i++) {
if (R_IsNA(x[i])) y.push_back(i+1);
}
return wrap(y);
}
which we can run like this in R:
> which4(t1)
[1] 3 4 8
Performance
Note that we have changed the above solution to reserve space for the output vector. This replaces which3 which is:
// [[Rcpp::export]]
IntegerVector which3( NumericVector x) {
int nx = x.size();
IntegerVector y;
for(int i = 0; i < nx; i++) {
// if (internal::Rcpp_IsNA(x[i])) y.push_back(i+1);
if (R_IsNA(x[i])) y.push_back(i+1);
}
return y;
}
Then the performance on a vector 9 elements long is the following with which4 the fastest:
> library(rbenchmark)
> benchmark(retIdxNA(t1), whichNA(t1), which2(t1), which3(t1), which4(t1),
+ replications = 10000, order = "relative")[1:4]
test replications elapsed relative
5 which4(t1) 10000 0.14 1.000
4 which3(t1) 10000 0.16 1.143
1 retIdxNA(t1) 10000 0.17 1.214
2 whichNA(t1) 10000 0.17 1.214
3 which2(t1) 10000 0.25 1.786
Repeating this for a vector 9000 elements long the Armadillo solution comes in quite a bit faster than the others. Here which3 (which is the same as which4 except it does not reserve space for the output vector) comes in worst while which4 comes second.
> tt <- rep(t1, 1000)
> benchmark(retIdxNA(tt), whichNA(tt), which2(tt), which3(tt), which4(tt),
+ replications = 1000, order = "relative")[1:4]
test replications elapsed relative
2 whichNA(tt) 1000 0.09 1.000
5 which4(tt) 1000 0.79 8.778
3 which2(tt) 1000 1.03 11.444
1 retIdxNA(tt) 1000 1.19 13.222
4 which3(tt) 1000 23.58 262.000
All of the solutions above are serial. Although not trivial, it is quite possible to take advantage of threading for implementing which. See this write up for more details. Although for such small sizes, it would not more harm than good. Like taking a plane for a small distance, you lose too much time at airport security..
R implements which by allocating memory for a logical vector as large as the input, does a single pass to store the indices in this memory, then copy it eventually into a proper logical vector.
Intuitively this seems less efficient than a double pass loop, but not necessarily, as copying a data range is cheap. See more details here.
Just write a function for yourself like:
which_1<-function(a,b){
return(which(a>b))
}
Then pass this function into rcpp.