I am trying to create a matrix with random numbers where the rowSums should exactly be 1.
I already have a condition which checks if the rowSums is not 1 and tries to correct it.
When I print out the result it looks correct but if I test if all values are 1 it gives me some FALSE values.
How can I correct that?
library(Rcpp)
cppFunction('
NumericMatrix imembrandc(int n, int k) {
NumericMatrix u( n , k );
IntegerVector sequ = seq(1,100);
NumericVector sampled;
for (int i=0; i < k; ++i) {
sampled = sample(sequ, n);
u(_,i) = sampled / sum(sampled);
}
if (is_true(any(rowSums(u) != 1))) {
u(_,1) = u(_,1) + (1 - rowSums(u));
}
return(u);
}')
When I print out the rowSums of the result it looks correct:
res = imembrandc(n = 10, k = 5)
rowSums(res)
[1] 1 1 1 1 1 1 1 1 1 1
But checking it gives some FALSEs:
rowSums(res) == 1
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
The canonical way to generate n random numbers that sum to 1 is to generate n - 1 values from [0,1), add 0 and 1 to the list and take the difference of the sorted list. Of course, this depends on the distribution you want for the random numbers. This can be expressed in R as
set.seed(42)
v <- diff(sort(c(0, runif(5), 1)))
v
#> [1] 0.28613953 0.35560598 0.18870211 0.08435842 0.02226937 0.06292459
sum(v)
#> [1] 1
Created on 2019-05-24 by the reprex package (v0.2.1)
In your case in C++:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix imembrandc(int n, int k) {
NumericMatrix u(n, k);
for (int i = 0; i < n; ++i) {
NumericVector row = runif(k - 1);
row.push_back(0.0);
row.push_back(1.0);
u(i, _) = diff(row.sort());
}
return u;
}
/*** R
set.seed(42)
res = imembrandc(n = 10, k = 5)
rowSums(res)
rowSums(res) == 1
all.equal(rowSums(res),rep(1, nrow(res)))
*/
Note that I am generating rows to begin with, while you were generating columns and then tried to correct the rowSum. Output:
> set.seed(42)
> res = imembrandc(n = 10, k = 5)
> rowSums(res)
[1] 1 1 1 1 1 1 1 1 1 1
> rowSums(res) == 1
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> all.equal(rowSums(res),rep(1, nrow(res)))
[1] TRUE
BTW, all.equal gives TRUE also for your matrix, since the difference is really small. But I find it better to avoid the problem from the beginning.
Is there a more effective way to match matrix rows when using large matrices?
I have a vector with values that correspond to a matrix of 2^N rows. N is typically large e.g., >20. Each row is a unique combination of N={0,1} values and represents a 'position' on a decision space. I.e., for N=3 the rows would be
0 0 0,
0 0 1,
0 1 0,
1 0 0,
...,
1 1 1
I need to determine whether a position is a local maximum, i.e., whether the N neighboring positions are of lower values. For example, to the position 0 0 0, the neighboring positions are 1 0 0, 0 1 0, and 0 0 1, accordingly.
I have coded the following solution that does the job but very slowly for large N.
library(prodlim) #for row.match command
set.seed(1234)
N=10
space = as.matrix(expand.grid(rep(list(0:1), N))) #creates all combinations of 0-1 along N-dimensions
performance = replicate(2^N, runif(1, min=0, max=1)) #corresponding values for each space-row (position)
#determine whether a space position is a local maxima, that is, the N neighboring positions are smaller in performance value
system.time({
local_peaks_pos = matrix(NA,nrow=2^N, ncol=1)
for(v in 1:2^N)
{
for(q in 1:N)
{
temp_local_pos = space[v,1:N]
temp_local_pos[q] = abs(temp_local_pos[q]-1)
if(performance[row.match(temp_local_pos[1:N], space[,1:N])] > performance[v])
{
local_peaks_pos[v,1] = 0
break
}
}
}
local_peaks_pos[is.na(local_peaks_pos)] = 1
})
user system elapsed
9.94 0.05 10.06
As Gabe mentioned in his comment,
you can exploit the fact that your decision space can be interpreted as single integers:
set.seed(1234L)
N <- 10L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2L ^ (0L:(N - 1L))))
is_local_max <- sapply(0L:(2^N - 1), function(i) {
multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
multipliers[multipliers == 0L] <- 1L
neighbors <- i + powers_of_two * multipliers
# compensate that R vectors are 1-indexed
!any(performance[neighbors + 1L] > performance[i + 1L])
})
# compensate again
local_peaks_int <- which(is_local_max) - 1L
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
as.integer(rev(intToBits(int)[1L:N]))
}))
> head(local_peaks_binary)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 1 0 0
[2,] 0 0 0 0 1 0 0 1 1 0
[3,] 0 0 0 0 1 1 1 1 0 0
[4,] 0 0 0 1 0 0 0 1 1 1
[5,] 0 0 0 1 0 1 0 1 0 1
[6,] 0 0 0 1 1 0 1 1 1 0
In decimal,
multipliers contains the the sign of the powers_of_two so that,
when added to the current integer,
it represents a bit flip in binary.
For example,
if the original binary was 0 0 and we flip one bit to get 1 0,
it's as if we added 2^1 in decimal,
but if it was originally 1 0 and we flip one bit to get 0 0,
then we subtracted 2^1 in decimal.
Each row in local_peaks_binary is a binary from your decision space,
where the least significant bit is on the right.
So, for example, the first local peak is a decimal 4.
See this question for the mapping of integers to binary.
EDIT: and if you want to do it in parallel:
library(doParallel)
set.seed(1234L)
N <- 20L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2 ^ (0:(N - 1))))
num_cores <- detectCores()
workers <- makeCluster(num_cores)
registerDoParallel(workers)
chunks <- splitIndices(length(performance), num_cores)
chunks <- lapply(chunks, "-", 1L)
local_peaks_int <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE) %dopar% {
is_local_max <- sapply(chunk, function(i) {
multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
multipliers[multipliers == 0L] <- 1L
neighbors <- i + powers_of_two * multipliers
# compensate that R vectors are 1-indexed
!any(performance[neighbors + 1L] > performance[i + 1L])
})
# return
chunk[is_local_max]
}
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
as.integer(rev(intToBits(int)[1L:N]))
}))
stopCluster(workers); registerDoSEQ()
The above completes in ~2.5 seconds in my system,
which has 4 CPU cores.
Here is a C++ version that uses multi-threading but,
at least in my system with 4 threads,
it doesn't seem faster than Gabe's Fortran version.
However, when I try to run Gabe's Fortran code in a new session,
I get the following error with N <- 29L:
cannot allocate vector of size 4.0 Gb.
EDIT: Apparently I changed something important along the way,
because after testing again,
the C++ version actually seems faster.
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
#include <cstddef> // size_t
#include <vector>
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace std;
using namespace Rcpp;
using namespace RcppParallel;
class PeakFinder : public Worker
{
public:
PeakFinder(const NumericVector& performance, vector<int>& peaks, const int N)
: performance_(performance)
, peaks_(peaks)
, N_(N)
{ }
void operator()(size_t begin, size_t end) {
vector<int> peaks;
for (size_t i = begin; i < end; i++) {
bool is_local_peak = true;
unsigned int mask = 1;
for (int exponent = 0; exponent < N_; exponent++) {
unsigned int neighbor = static_cast<unsigned int>(i) ^ mask; // bitwise XOR
if (performance_[i] < performance_[neighbor]) {
is_local_peak = false;
break;
}
mask <<= 1;
}
if (is_local_peak)
peaks.push_back(static_cast<int>(i));
}
mutex_.lock();
peaks_.insert(peaks_.end(), peaks.begin(), peaks.end());
mutex_.unlock();
}
private:
const RVector<double> performance_;
vector<int>& peaks_;
const int N_;
tthread::mutex mutex_;
};
// [[Rcpp::export]]
IntegerVector local_peaks(const NumericVector& performance, const int N) {
vector<int> peaks;
PeakFinder peak_finder(performance, peaks, N);
// each thread call will check at least 1024 values
parallelFor(0, performance.length(), peak_finder, 1024);
IntegerVector result(peaks.size());
int i = 0;
for (int peak : peaks) {
result[i++] = peak;
}
return result;
}
After saving the C++ code in local-peaks.cpp,
this code:
library(Rcpp)
library(RcppParallel)
sourceCpp("local-peaks.cpp")
set.seed(1234L)
N <- 29L
performance <- runif(2^N)
system.time({
local_peaks_int <- local_peaks(performance, N)
})
finished in ~2 seconds
(without considering the time required to allocate performance).
If you do need the binary representation,
you can change local_peaks like this
(see this question):
// [[Rcpp::export]]
IntegerMatrix local_peaks(const NumericVector& performance, const int N) {
vector<int> peaks;
PeakFinder peak_finder(performance, peaks, N);
// each thread call will check at least 1024 values
parallelFor(0, performance.length(), peak_finder, 1024);
// in case you want the same order every time, #include <algorithm> and uncomment next line
// sort(peaks.begin(), peaks.end());
IntegerMatrix result(peaks.size(), N);
int i = 0;
for (int peak : peaks) {
for (int j = 0; j < N; j++) {
result(i, N - j - 1) = peak & 1;
peak >>= 1;
}
i++;
}
return result;
}
Here is one solution that follows the same general structure as your example code. intToBits and packBits map to and from the binary representation for each integer (subtracting one to start at zero). The inner loop flips each of the N bits to get the neighbors. On my laptop, this runs in a fraction of a second for N=10 and around a minute for N=20. The commented code stores some information from neighbors already tested so as to not redo the calculation. Uncommenting those lines makes it run in about 35 seconds for N=20.
loc_max <- rep(1, 2^N)
for (v in 1:2^N){
## if (loc_max[v] == 0) next
vbits <- intToBits(v-1)
for (q in 1:N){
tmp <- vbits
tmp[q] <- !vbits[q]
pos <- packBits(tmp, type = "integer") + 1
if (performance[pos] > performance[v]){
loc_max[v] <- 0
break
## } else {
## loc_max[pos] <- 0
}
}
}
identical(loc_max, local_peaks_pos[, 1])
## [1] TRUE
EDIT:
It sounds like you need every bit of speed possible, so here's another suggestion that relies on compiled code to run significantly faster than my first example. A fraction of a second for N=20 and a bit under 20 seconds for N=29 (the largest example I could fit in my laptop's RAM).
This is using a single core; you could either parallelize this, or alternatively run it in a single core and parallelize your Monte Carlo simulations instead.
library(inline)
loopcode <-
" integer v, q, pos
do v = 0, (2**N)-1
do q = 0, N-1
if ( btest(v,q) ) then
pos = ibclr(v, q)
else
pos = ibset(v, q)
end if
if (performance(pos) > performance(v)) then
loc_max(v) = 0
exit
end if
end do
end do
"
loopfun <- cfunction(sig = signature(performance="numeric", loc_max="integer", n="integer"),
dim=c("(0:(2**n-1))", "(0:(2**n-1))", ""),
loopcode,
language="F95")
N <- 20
performance = runif(2^N, min=0, max=1)
system.time({
floop <- loopfun(performance, rep(1, 2^N), N)
})
## user system elapsed
## 0.049 0.003 0.052
N <- 29
performance = runif(2^N, min=0, max=1)
system.time({
floop <- loopfun(performance, rep(1, 2^N), N)
})
## user system elapsed
## 17.892 1.848 19.741
I don't think pre-computing the neighbors would help much here since I'd guess the comparisons accessing different sections of such a large array are the most time consuming part.
I have a function where I need to make a table (tab, then change one value - the value where tab.names() == k, where k is given in the function call.
Looking at http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf, I've hoped that the following code would work (replacing "foo" with a variable name), but I guess that requires the element name to be static, and mine won't be. I've tried using which but that won't compile (invalid conversion from 'char' to 'Rcpp::traits::storage_type<16>::type {aka SEXPREC*}' - so I'm doing something wrong there.
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
char kc = '0' + k;
// what I need a working version of:
tab(kc) = 1; // gets ignored, as does a [] version of the same thing.
// or
tab('0' + k) = 1; // also ignored
int ki = which(all_proteins == kc); // gives me compile errors
// extra credit
// tab.names(k-1) = "-1";
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
# same thing in R:
expected_output <- table(x)
expected_output # before modification
# x
# 3 4 5 6 7 9 10 12
# 2 4 3 3 4 2 1 1
expected_output[as.character(k)] <- 1 # this is what I need help with
expected_output
# x
# 3 4 5 6 7 9 10 12
# 2 4 1 3 4 2 1 1
# extra credit:
names(expected_output)[as.character(k)] <- -1
*/
I'm still learning rcpp, and more importantly, still learning how to read the manual pages and plug in the right search terms into google/stackoverflow. I'm sure this is basic stuff (and I'm open to better methods - I currently think like an R programmer in terms of initial approaches to problems, not a C++ programmer.)
(BTW - The use of arma::vec is used in other parts of the code which I'm not showing for simplicity - I realize it's not useful here. I debated on switching it, but decided against it on the principle that I've tested that part, it works, and the last thing I want to do is introduce an extra bug...)
Thanks!
You can use the .findName() method to get the relevant index:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
int index = tab.findName(std::string(1, '0' + k));
tab(index) = 1;
all_proteins(index) = "-1";
tab.names() = all_proteins;
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
*/
Output:
> Rcpp::sourceCpp('table-name.cpp')
> set.seed(23)
> x <- rpois(20, 5)
> k <- 5
> fun(x, k)
3 4 -1 6 7 9 10 12
2 4 1 3 4 2 1 1
You could write your own function (use String instead of char):
int first_which_equal(const CharacterVector& x, String y) {
int n = x.size();
for (int i = 0; i < n; i++) {
if (x[i] == y) return(i);
}
return -1;
}
Also, it seems that tab(kc) is converting kc to an integer representation.
I'm working with large matrices of about 2500x2500x50 (lonxlatxtime). The matrix contains only 1 and 0. I need to know for each timestep the sum of the 24 surrounding elements. So far I did it about this way:
xdim <- 2500
ydim <- 2500
tdim <- 50
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
This works, but it is much too slow for my needs. Has anybody please an advice how to speed up?
Intro
I have to say, there are so many hidden things behind just the setup of the arrays. The remainder of the problem is trivial though. As a result, there are two ways to go about it really:
Bruteforce given by #Alex (written in C++)
Observing replication patterns
Bruteforce with OpenMP
If we want to 'brute force' it, then we can use the suggestion given by #Alex to employ OpenMP with Armadillo
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Add a flag to enable OpenMP at compile time
// [[Rcpp::plugins(openmp)]]
// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::export]]
arma::cube cube_parallel(arma::cube a, arma::cube res, int cores = 1) {
// Extract the different dimensions
unsigned int tdim = res.n_slices;
unsigned int xdim = res.n_rows;
unsigned int ydim = res.n_cols;
// Same calculation loop
#pragma omp parallel for num_threads(cores)
for (unsigned int t = 0; t < tdim; t++){
// pop the T
arma::mat temp_mat = a.slice(t);
// Subset the rows
for (unsigned int x = 2; x < xdim-2; x++){
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
// Iterate over the columns with unit accumulative sum
for (unsigned int y = 2; y < ydim-2; y++){
res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2));
}
}
}
return res;
}
Replication Patterns
However, the smarter approach is understanding how the array(0:1, dims) is being constructed.
Most notably:
Case 1: If xdim is even, then only the rows of a matrix alternate.
Case 2: If xdim is odd and ydim is odd, then rows alternate as well as the matrices alternate.
Case 3: If xdim is odd and ydim is even, then only the rows alternate
Examples
Let's see the cases in action to observe the patterns.
Case 1:
xdim <- 2
ydim <- 3
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 1 1
, , 2
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 1 1
Case 2:
xdim <- 3
ydim <- 3
tdim <- 3
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 1
[3,] 0 1 0
, , 2
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 0
[3,] 1 0 1
, , 3
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 1
[3,] 0 1 0
Case 3:
xdim <- 3
ydim <- 4
tdim <- 2
a <- array(0:1,dim=c(xdim,ydim,tdim))
Output:
, , 1
[,1] [,2] [,3] [,4]
[1,] 0 1 0 1
[2,] 1 0 1 0
[3,] 0 1 0 1
, , 2
[,1] [,2] [,3] [,4]
[1,] 0 1 0 1
[2,] 1 0 1 0
[3,] 0 1 0 1
Pattern Hacking
Alrighty, based on the above discussion, we opt to make a bit of code the exploits this unique pattern.
Creating Alternating Vectors
An alternating vector in this case switches between two different values.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// ------- Make Alternating Vectors
arma::vec odd_vec(unsigned int xdim){
// make a temporary vector to create alternating 0-1 effect by row.
arma::vec temp_vec(xdim);
// Alternating vector (anyone have a better solution? )
for (unsigned int i = 0; i < xdim; i++) {
temp_vec(i) = (i % 2 ? 0 : 1);
}
return temp_vec;
}
arma::vec even_vec(unsigned int xdim){
// make a temporary vector to create alternating 0-1 effect by row.
arma::vec temp_vec(xdim);
// Alternating vector (anyone have a better solution? )
for (unsigned int i = 0; i < xdim; i++) {
temp_vec(i) = (i % 2 ? 1 : 0); // changed
}
return temp_vec;
}
Creating the three cases of matrix
As mentioned above, there are three cases of matrix. The even, first odd, and second odd cases.
// --- Handle the different cases
// [[Rcpp::export]]
arma::mat make_even_matrix(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
temp_mat.each_col() = even_vec(xdim);
return temp_mat;
}
// xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case1(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
arma::vec e_vec = even_vec(xdim);
arma::vec o_vec = odd_vec(xdim);
// Alternating column
for (unsigned int i = 0; i < ydim; i++) {
temp_mat.col(i) = (i % 2 ? o_vec : e_vec);
}
return temp_mat;
}
// xdim is odd and ydim is odd
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){
arma::mat temp_mat(xdim,ydim);
arma::vec e_vec = even_vec(xdim);
arma::vec o_vec = odd_vec(xdim);
// Alternating column
for (unsigned int i = 0; i < ydim; i++) {
temp_mat.col(i) = (i % 2 ? e_vec : o_vec); // slight change
}
return temp_mat;
}
Calculation Engine
Same as the previous solution, just without the t as we no longer need to repeat calculations.
// --- Calculation engine
// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat){
unsigned int xdim = temp_mat.n_rows;
unsigned int ydim = temp_mat.n_cols;
arma::mat res = temp_mat;
// Subset the rows
for (unsigned int x = 2; x < xdim-2; x++){
arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
// Iterate over the columns with unit accumulative sum
for (unsigned int y = 2; y < ydim-2; y++){
res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
}
}
return res;
}
Call Main Function
Here is the core function that pieces everything together. This gives us the desired distance arrays.
// --- Main Engine
// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {
// Initialize values in A
arma::cube res(xdim,ydim,tdim);
if(xdim % 2 == 0){
res.each_slice() = calc_matrix(make_even_matrix(xdim, ydim));
}else{
if(ydim % 2 == 0){
res.each_slice() = calc_matrix(make_odd_matrix_case1(xdim, ydim));
}else{
arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case1(xdim, ydim));
arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));
for(unsigned int t = 0; t < tdim; t++){
res.slice(t) = (t % 2 ? sec_odd_mat : first_odd_mat);
}
}
}
return res;
}
Timing
Now, the real truth is how well does this perform:
Unit: microseconds
expr min lq mean median uq max neval
r_1core 3538.022 3825.8105 4301.84107 3957.3765 4043.0085 16856.865 100
alex_1core 2790.515 2984.7180 3461.11021 3076.9265 3189.7890 15371.406 100
cpp_1core 174.508 180.7190 197.29728 194.1480 204.8875 338.510 100
cpp_2core 111.960 116.0040 126.34508 122.7375 136.2285 162.279 100
cpp_3core 81.619 88.4485 104.54602 94.8735 108.5515 204.979 100
cpp_cache 40.637 44.3440 55.08915 52.1030 60.2290 302.306 100
Script used for timing:
cpp_parallel = cube_parallel(a,res, 1)
alex_1core = alex(a,res,xdim,ydim,tdim)
cpp_cache = dim_to_cube(xdim,ydim,tdim)
op_answer = cube_r(a,res,xdim,ydim,tdim)
all.equal(cpp_parallel, op_answer)
all.equal(cpp_cache, op_answer)
all.equal(alex_1core, op_answer)
xdim <- 20
ydim <- 20
tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
ga = microbenchmark::microbenchmark(r_1core = cube_r(a,res,xdim,ydim,tdim),
alex_1core = alex(a,res,xdim,ydim,tdim),
cpp_1core = cube_parallel(a,res, 1),
cpp_2core = cube_parallel(a,res, 2),
cpp_3core = cube_parallel(a,res, 3),
cpp_cache = dim_to_cube(xdim,ydim,tdim))
Here's one solution that's fast for a large array:
res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
dim(res) <- c(xdim, ydim, tdim)
I filtered the array using rep(1,5) as the weights (i.e. sum values within a neighborhood of 2) along each dimension. I then modified the dim attribute since it initially comes out as a matrix.
Note that this wraps the sum around at the edges of the array (which might make sense since you're looking at latitude and longitude; if not, I can modify my answer).
For a concrete example:
xdim <- 500
ydim <- 500
tdim <- 15
a <- array(0:1,dim=c(xdim,ydim,tdim))
and here's what you're currently using (with NAs at the edges) and how long this example takes on my laptop:
f1 <- function(a, xdim, ydim, tdim){
res <- array(NA_integer_,dim=c(xdim,ydim,tdim))
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
return(res)
}
system.time(res1 <- f1(a, xdim, ydim, tdim))
# user system elapsed
# 14.813 0.005 14.819
And here's a comparison with the version I described:
f2 <- function(a, xdim, ydim, tdim){
res <- apply(a, 3, function(a) t(filter(t(filter(a, rep(1, 5), circular=TRUE)), rep(1, 5), circular=TRUE)))
dim(res) <- c(xdim, ydim, tdim)
return(res)
}
system.time(res2 <- f2(a, xdim, ydim, tdim))
# user system elapsed
# 1.188 0.047 1.236
You can see there's a significant speed boost (for large arrays). And to check that it's giving the correct solution (note that I'm adding NAs so both results match, since the one I gave filters in a circular manner):
## Match NAs
res2NA <- ifelse(is.na(res1), NA, res2)
all.equal(res2NA, res1)
# [1] TRUE
I'll add that your full array (2500x2500x50) took just under a minute (about 55 seconds), although it did use a lot of memory in the process, FYI.
Your current code has a lot of overhead from redundant subsetting and calculation. Clean this up if you want better speed.
At xdim <- ydim <- 20; tdim <- 5, I see a 23% speedup on my machine.
At xdim <- ydim <- 200; tdim <- 10, I see a 25% speedup.
This comes at small cost of additional memory, which is obvious by examining the code below.
xdim <- ydim <- 20; tdim <- 5
a <- array(0:1,dim=c(xdim,ydim,tdim))
res <- array(0:1,dim=c(xdim,ydim,tdim))
microbenchmark(op= {
for (t in 1:tdim){
for (x in 3:(xdim-2)){
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
}
}
}
},
alex= {
for (t in 1:tdim){
temp <- a[,,t]
for (x in 3:(xdim-2)){
temp2 <- temp[(x-2):(x+2),]
for (y in 3:(ydim-2)){
res[x,y,t] <- sum(temp2[,(y-2):(y+2)])
}
}
}
}, times = 50)
Unit: milliseconds
expr min lq mean median uq max neval cld
op 4.855827 5.134845 5.474327 5.321681 5.626738 7.463923 50 b
alex 3.720368 3.915756 4.213355 4.012120 4.348729 6.320481 50 a
Further improvements:
If you write this in C++, my guess is that recognizing res[x,y,t] = res[x,y-1,t] - sum(a[...,y-2,...]) + sum(a[...,y+2,...]) will save you additional time. In R, it did not in my timing tests.
This problem is also embarrassingly parallel. There's no reason you couldn't split the t dimension to make more use of a multi-core architecture.
Both of these are left to the reader / OP.