Negative subscripts in matrix indexing - r

In Rcpp/RcppArmadillo I want to do the following: From an n x n matrix A, I would like to extract a submatrix A[-j, -j] where j is a vector of indices: In R it can go like
A = matrix(1:16, 4, 4)
j = c(2, 3)
A[-j, -j]
Seems that this functionality is not available in Rcpp or RcppArmadillo - sorry if I have overlooked something. One approach in R is
pos = setdiff(1:nrow(A), j)
A[pos, pos]
That will carry over to RcppArmadillo, but it seems akward having to create the vector pos as the complement of j - and I am not sure how to do it efficiently.
Does anyone have an idea for an efficient implementation / or a piece of code to share?

The armadillo docs have the function .shed which takes an argument(s) that "... contains the indices of rows/columns/slices to remove". From my reading to remove both rows and columns will take two calls of .shed().
Using your example
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
arma::mat fun(arma::mat X, arma::uvec row, arma::uvec col) {
X.shed_cols(col); // remove columns
X.shed_rows(row); // remove rows
return(X);
}
/***R
A = matrix(1:16, 4, 4)
j = c(2, 3)
A[-j, -j]
# minus one for zero indexing
fun(A, j-1, j-1)
*/

Related

Compute product of large 3-D arrays in R

I am working on an optimization problem, and to supply the analytic gradient to the routine, I need to compute the gradient of large 3D arrays with respect to parameters. The largest of these arrays s are of dimensions [L,N,J] where L,J ~ 2000, and N= 15. L and N stand for nodes over which the arrays are then aggregated up with some fixed weights w to vectors of length J. Computing the gradient naively generates a [L,N,J,J] arrays x whose elements are x(l,n,j,k) = -s(l,n,j)s(l,n,k) if j=/=k and x(l,n,j,j) = s(l,n,j)(1-s(l,n,j)).
Several functions in the procedure would use x as input, but as of right now I cannot keep x in memory due to its size. My approach so far has been to compute and directly aggregate up x over L and N to only ever store JxJ matrices, but the downside is that I cannot reuse x in other functions. This is what the following code does:
arma::mat agg_dsnode_ddelta_v3(arma::cube s_lnj,
arma::mat w_ln,
arma::vec w_l){
// Normal Matrix dimensions
unsigned int L = s_lnj.n_rows;
unsigned int N = s_lnj.n_cols;
unsigned int J = s_lnj.n_slices;
//resulting matrix
arma::mat ds_ddelta_jj = arma::mat(J,J, arma::fill::zeros);
for (unsigned int l = 0; l < L; l++) {
for (unsigned int n = 0; n < N; n++) {
arma::vec s_j = s_lnj.subcube(arma::span(l), arma::span(n), arma::span());
ds_ddelta_jj += - arma::kron(w_l(l) * w_ln(l,n) * s_j, s_j.as_row()) + arma::diagmat(w_l(l) * w_ln(l,n) * s_j);
}
}
return ds_ddelta_jj;
}
Alternatively, the 4-D array x could for instance be computed with sparseMatrix, but this approach does not scale up when the L and J increase
library(Matrix)
L = 2
N = 3
J = 4
s_lnj <- array(rnorm(L*N*J), dim=c(L,N,J))
## create spare Matrix with s(l,n,:) vertically on the diagonal
As_lnj = A = sparseMatrix(i=c(1:(L*N*J)),j=rep(1:(L*N), each=J),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## create spare Matrix with s(l,n,:) horizontally on the diagonal
Bs_lnj = sparseMatrix(i=rep(1:(L*N), each=J),j=c(1:(L*N*J)),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## create spare Matrix with s(l,n,:) diagonnally
Cs_lnj = sparseMatrix(i=c(1:(L*N*J)),j=c(1:(L*N*J)),x= as.vector(aperm(s_lnj, c(3, 1, 2))))
## compute 4-D array with sparseMatrix product
x = -(As_lnj %*% Bs_lnj) + Cs_lnj
I was wondering if you knew of faster way to implement the first code, or alternatively of an approach that would make the second one scalable.
Thank you in advance

How can I speed up my Rcpp code, which only carries out simple operations?

I'm trying to write a function that takes in a matrix and computes a value for every pair of columns. The matrix always has 2000 rows, but can potentially have a very large number of columns (up to 100,000 or so). The R code I started with is as follows:
x_dist <- data.frame(array(0,dim=c(ncol(x),ncol(x))))
cs <- colSums(x)
for (i in 1:ncol(x)) {
p_i <- x[,i]
for (j in 1:ncol(x)) {
p_j <- x[,j]
s <- p_i+p_j
fac <- cs[i]/(cs[i]+cs[j])
N1 <- fac*s
N2 <- (1-fac)*s
d1 <- (p_i+1)/(N1+1)
d2 <- (p_j+1)/(N2+1)
x_dist[i,j] <- sum(N1+N2-N1*d1-N2*d2+p_i*log(d1)+p_j*log(d2))
}
}
This function is quite slow. When there are only 400 columns in the matrix x, it takes about 32 seconds, and obviously grows quadratically in the number of columns.
Since I've heard Rcpp is good for speeding up for loops and matrix operations, I decided to give that a try. I am completely new to it, but ended up putting together the following function:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix wdist(NumericMatrix x) {
int nrow = x.nrow(),ncol=x.ncol();
NumericMatrix m = no_init_matrix(ncol,ncol);
NumericVector v1 = no_init_vector(nrow);
NumericVector v2 = no_init_vector(nrow);
NumericVector s = no_init_vector(nrow);
NumericVector N1 = no_init_vector(nrow);
NumericVector N2 = no_init_vector(nrow);
NumericVector d1 = no_init_vector(nrow);
NumericVector d2 = no_init_vector(nrow);
for(int i=0; i<ncol; ++i){
v1 = x(_,i);
for(int j=0; j<i; ++j){
v2 = x(_,j);
s = v1+v2;
N1 = sum(v1)*s/(sum(v1)+sum(v2));
N2 = s-N1;
d1 = (v1+1)/(N1+1);
d2 = (v2+1)/(N2+1);
m(i,j) = sum(N1+N2-N1*d1-N2*d2+v1*log(d1)+v2*log(d2));
}
}
return m;
}
This certainly makes a big difference. Now with 400 columns, this takes about 8 seconds. I am pleased by the improvement, but this is still intractably slow for my current test case of interest, which is 32,000 columns. I feel like I am doing some relatively simple operations, so it's confusing to me why my code is still this slow. I've tried to do some reading on writing efficient Rcpp code, but haven't found anything that helps address my issue. Please let me know if there is anything I'm doing wrong or any improvements I can look into to make my code faster (or even the R code itself, if that can be made faster than the Rcpp code!)
Some example data could be:
set.seed(121220)
x <- array(rpois(2000*400,3),dim=c(2000,400))
I refactored your base R code and hope it could speed up somewhat
f <- function(...) {
p <- x[, t(...)]
N <- matrix(rowSums(p), ncol = 1) %*% colSums(p) / sum(p)
d <- (p + 1) / (N + 1)
sum(N - N * d + p * log(d))
}
x_dist <- diag(0, ncol(x))
x_dist[lower.tri(x_dist)] <- combn(ncol(x), 2, FUN = f)
x_dist <- pmax(x_dist, t(x_dist))
To speed up your Rcpp code, you can try the following nested for loops after initializing your matrix m as a all-zero matrix:
for(int i=0; i<ncol-1; ++i){
v1 = x(_,i);
for(int j=i+1; j<ncol; ++j){
v2 = x(_,j);
s = v1+v2;
N1 = sum(v1)*s/sum(s);
N2 = s-N1;
d1 = (v1+1)/(N1+1);
d2 = (v2+1)/(N2+1);
val = sum(N1+N2-N1*d1-N2*d2+v1*log(d1)+v2*log(d2));
m(i,j) = val;
m(j,i) = val;
}
}
which applies the property that the matrix is symmetry and thus reduce computational complexity by half.

Apply a function rowwise in Rcpp/RcppArmadillo

I have a function which takes a vector as input and outputs a scalar and I want to apply this function to a number of observations. The data is structured in a matrix (rows are the number of observations and columns the variables) and the function is:
// [[Rcpp::export]]
double gaussianweight(arma::vec x, arma::mat H) {
double c = std::pow(2 * arma::datum::pi, -0.5 * x.n_rows);
double s = std::pow(arma::det(H), -1);
arma::mat Hinv = arma::inv(H);
return(c * s * std::exp(-0.5 * arma::dot(Hinv * x, Hinv * x)));
}
to every row vector of a arma::mat X. How would I do that efficiently? A loop that lopps over the rows of X or are there better solutions? I use R for the most time and really got used to avoid loops whenever it is possible. I tried the .each_row() operations but had no luck...

Rcpp and R: pass by reference

Working with Rcpp and R I observed the following behaviour, which I do not understand at the moment. Consider the following simple function written in Rcpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix hadamard_product(NumericMatrix & X, NumericMatrix & Y){
unsigned int ncol = X.ncol();
unsigned int nrow = X.nrow();
int counter = 0;
for (unsigned int j=0; j<ncol; j++) {
for (unsigned int i=0; i<nrow; i++) {
X[counter++] *= Y(i, j);
}
}
return X;
}
This simply returns the component-wise product of two matrices. Now I know that the arguments to this function are passed by reference, i.e., calling
M <- matrix(rnorm(4), ncol = 2)
N <- matrix(rnorm(4), ncol = 2)
M_copy <- M
hadamard_product(M, N)
will overwrite the original M. However, it also overwrites M_copy, which I do not understand. I thought that M_copy <- M makes a copy of the object M and saves it somewhere in the memory and not that this assignment points M_copy to M, which would be the behaviour when executing
x <- 1
y <- x
x <- 2
for example. This does not change y but only x.
So why does the behaviour above occur?
No, R does not make a copy immediately, only if it is necessary, i.e., copy-on-modify:
x <- 1
tracemem(x)
#[1] "<0000000009A57D78>"
y <- x
tracemem(x)
#[1] "<0000000009A57D78>"
x <- 2
tracemem(x)
#[1] "<00000000099E9900>"
Since you modify M by reference outside R, R can't know that a copy is necessary. If you want to ensure a copy is made, you can use data.table::copy. Or avoid the side effect in your C++ code, e.g., make a deep copy there (by using clone).

Dynamic programming problems using iteration

I have spent a lot of time to learn about implementing/visualizing dynamic programming problems using iteration but I find it very hard to understand, I can implement the same using recursion with memoization but it is slow when compared to iteration.
Can someone explain the same by a example of a hard problem or by using some basic concepts. Like the matrix chain multiplication, longest palindromic sub sequence and others. I can understand the recursion process and then memoize the overlapping sub problems for efficiency but I can't understand how to do the same using iteration.
Thanks!
Dynamic programming is all about solving the sub-problems in order to solve the bigger one. The difference between the recursive approach and the iterative approach is that the former is top-down, and the latter is bottom-up. In other words, using recursion, you start from the big problem you are trying to solve and chop it down to a bit smaller sub-problems, on which you repeat the process until you reach the sub-problem so small you can solve. This has an advantage that you only have to solve the sub-problems that are absolutely needed and using memoization to remember the results as you go. The bottom-up approach first solves all the sub-problems, using tabulation to remember the results. If we are not doing extra work of solving the sub-problems that are not needed, this is a better approach.
For a simpler example, let's look at the Fibonacci sequence. Say we'd like to compute F(101). When doing it recursively, we will start with our big problem - F(101). For that, we notice that we need to compute F(99) and F(100). Then, for F(99) we need F(97) and F(98). We continue until we reach the smallest solvable sub-problem, which is F(1), and memoize the results. When doing it iteratively, we start from the smallest sub-problem, F(1) and continue all the way up, keeping the results in a table (so essentially it's just a simple for loop from 1 to 101 in this case).
Let's take a look at the matrix chain multiplication problem, which you requested. We'll start with a naive recursive implementation, then recursive DP, and finally iterative DP. It's going to be implemented in a C/C++ soup, but you should be able to follow along even if you are not very familiar with them.
/* Solve the problem recursively (naive)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_rn(int p[], int n, int i, int j) {
// A matrix multiplied by itself needs no operations
if (i == j) return 0;
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
int min = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_rn(p, n, i, k) + solve_rn(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < min) min = tmp;
}
// Return solution for this sub-problem
return min;
}
To compute the result, we starts with the big problem:
solve_rn(p, n, 1, n - 1)
The key of DP is to remember all the solutions to the sub-problems instead of forgetting them, so we don't need to recompute them. It's trivial to make a few adjustments to the above code in order to achieve that:
/* Solve the problem recursively (DP)
p - matrix dimensions
n - size of p
i..j - state (sub-problem): range of parenthesis */
int solve_r(int p[], int n, int i, int j) {
/* We need to remember the results for state i..j.
This can be done in a matrix, which we call dp,
such that dp[i][j] is the best solution for the
state i..j. We initialize everything to 0 first.
static keyword here is just a C/C++ thing for keeping
the matrix between function calls, you can also either
make it global or pass it as a parameter each time.
MAXN is here too because the array size when doing it like
this has to be a constant in C/C++. I set it to 100 here.
But you can do it some other way if you don't like it. */
static int dp[MAXN][MAXN] = {{0}};
/* A matrix multiplied by itself has 0 operations, so we
can just return 0. Also, if we already computed the result
for this state, just return that. */
if (i == j) return 0;
else if (dp[i][j] != 0) return dp[i][j];
// A minimal solution for this sub-problem, we
// initialize it with the maximal possible value
dp[i][j] = std::numeric_limits<int>::max();
// Recursively solve all the sub-problems
for (int k = i; k < j; ++k) {
int tmp = solve_r(p, n, i, k) + solve_r(p, n, k + 1, j) + p[i - 1] * p[k] * p[j];
if (tmp < dp[i][j]) dp[i][j] = tmp;
}
// Return solution for this sub-problem
return dp[i][j];;
}
We start with the big problem as well:
solve_r(p, n, 1, n - 1)
Iterative solution is only to, well, iterate all the states, instead of starting from the top:
/* Solve the problem iteratively
p - matrix dimensions
n - size of p
We don't need to pass state, because we iterate the states. */
int solve_i(int p[], int n) {
// But we do need our table, just like before
static int dp[MAXN][MAXN];
// Multiplying a matrix by itself needs no operations
for (int i = 1; i < n; ++i)
dp[i][i] = 0;
// L represents the length of the chain. We go from smallest, to
// biggest. Made L capital to distinguish letter l from number 1
for (int L = 2; L < n; ++L) {
// This double loop goes through all the states in the current
// chain length.
for (int i = 1; i <= n - L + 1; ++i) {
int j = i + L - 1;
dp[i][j] = std::numeric_limits<int>::max();
for (int k = i; k <= j - 1; ++k) {
int tmp = dp[i][k] + dp[k+1][j] + p[i-1] * p[k] * p[j];
if (tmp < dp[i][j])
dp[i][j] = tmp;
}
}
}
// Return the result of the biggest problem
return dp[1][n-1];
}
To compute the result, just call it:
solve_i(p, n)
Explanation of the loop counters in the last example:
Let's say we need to optimize the multiplication of 4 matrices: A B C D. We are doing an iterative approach, so we will first compute the chains with the length of two: (A B) C D, A (B C) D, and A B (C D). And then chains of three: (A B C) D, and A (B C D). That is what L, i and j are for.
L represents the chain length, it goes from 2 to n - 1 (n is 4 in this case, so that is 3).
i and j represent the starting and ending position of the chain. In case L = 2, i goes from 1 to 3, and j goes from 2 to 4:
(A B) C D A (B C) D A B (C D)
^ ^ ^ ^ ^ ^
i j i j i j
In case L = 3, i goes from 1 to 2, and j goes from 3 to 4:
(A B C) D A (B C D)
^ ^ ^ ^
i j i j
So generally, i goes from 1 to n - L + 1, and j is i + L - 1.
Now, let's continue with the algorithm assuming that we are at the step where we have (A B C) D. We now need to take into account the sub-problems (which are already calculated): ((A B) C) D and (A (B C)) D. That is what k is for. It goes through all the positions between i and j and computes the sub problems.
I hope I helped.
The problem with recursion is the high number of stack frames that need to be pushed/popped. This can quickly become the bottle-neck.
The Fibonacci Series can be calculated with iterative DP or recursion with memoization. If we calculate F(100) in DP all we need is an array of length 100 e.g. int[100] and that's the guts of our used memory. We calculate all entries of the array pre-filling f[0] and f[1] as they are defined to be 1. and each value just depends on the previous two.
If we use a recursive solution we start at fib(100) and work down. Every method call from 100 down to 0 is pushed onto the stack, AND checked if it's memoized. These operations add up and iteration doesn't suffer from either of these. In iteration (bottom-up) we already know all of the previous answers are valid. The bigger impact is probably the stack frames; and given a larger input you may get a StackOverflowException for what was otherwise trivial with an iterative DP approach.

Resources