Rewriting R's cummin() function using Rcpp and allowing for NAs - r

I'm learning Rcpp. In this example, I'm attempting to roll my own cummin() function like base R's cummin(), except I'd like my version to have a na.rm argument. This is my attempt
cummin.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cummin_cpp(NumericVector x, bool narm = false){
// Given a numeric vector x, returns a vector of the
// same length representing the cumulative minimum value
// if narm = true, NAs will be ignored (The result may
// contain NAs if the first values of x are NA.)
// if narm = false, the resulting vector will return the
// cumulative min until the 1st NA value is encountered
// at which point all subsequent entries will be NA
if(narm){
// Ignore NAs
for(int i = 1; i < x.size(); i++){
if(NumericVector::is_na(x[i]) | (x[i-1] < x[i])) x[i] = x[i-1];
}
} else{
// Don't ignore NAs
for(int i = 1; i < x.size(); i++){
if(NumericVector::is_na(x[i-1]) | NumericVector::is_na(x[i])){
x[i] = NA_REAL;
} else if(x[i-1] < x[i]){
x[i] = x[i-1];
}
}
}
return x;
}
foo.R
library(Rcpp)
sourceCpp("cummin.cpp")
x <- c(3L, 1L, 2L)
cummin(x) # 3 1 1
cummin_cpp(x) # 3 1 1
class(cummin(x)) # integer
class(cummin_cpp(x)) # numeric
I have a few questions..
R's standard variable name is na.rm, not narm as I've done. However, it seems I can't use a dot in the c++ variable name. Is there a way around this so I can be consistent with R's convention?
I don't know ahead of time if the user's input is going to be a numeric vector or an integer vector, so I've used Rcpp's NumericVector type. Unfortunately, if the input is integer, the output is cast to numeric unlike base R's cummin() behavior. How do people usually deal with this issue?
The line if(NumericVector::is_na(x[i]) | (x[i-1] < x[i])) x[i] = x[i-1]; seems silly, but I don't know a better way to do this. Suggestions here?

I would use this:
template<typename T, int RTYPE>
Vector<RTYPE> cummin_cpp2(Vector<RTYPE> x, bool narm){
Vector<RTYPE> res = clone(x);
int i = 1, n = res.size();
T na;
if(narm){
// Ignore NAs
for(; i < n; i++){
if(ISNAN(res[i]) || (res[i-1] < res[i])) res[i] = res[i-1];
}
} else{
// Do not ignore NAs
for(; i < n; i++){
if(ISNAN(res[i-1])) {
na = res[i-1];
break;
} else if(res[i-1] < res[i]){
res[i] = res[i-1];
}
}
for(; i < n; i++){
res[i] = na;
}
}
return res;
}
// [[Rcpp::export]]
SEXP cummin_cpp2(SEXP x, bool narm = false) {
switch (TYPEOF(x)) {
case INTSXP: return cummin_cpp2<int, INTSXP>(x, narm);
case REALSXP: return cummin_cpp2<double, REALSXP>(x, narm);
default: Rcpp::stop("SEXP Type Not Supported.");
}
}
Try this on:
x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)
cummin_cpp(x, narm = TRUE)
x
cummin_cpp(x2)
x2
x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)
x3 <- replace(x, is.na(x), NaN)
cummin_cpp2(x, narm = TRUE)
x
cummin_cpp2(x2)
x2
cummin_cpp2(x3)
x3
Explanation:
Joran's advice is good, just wrap that in an R function
I use a dispatcher as Joseph Wood suggested
Beware that x is passed by reference and is modified if of the same type of what you declared (see these 2 slides)
You need to handle NA as well as NaN
You can use || instead of | to evaluate only the first condition if it is true.

Related

C way to detect and 'separate' a binary expression

I have a bottleneck in my code in expressions like any(x >= b | x == y) for a large x.
I'd like to avoid the allocation x >= b | x == y. I've found that it's easy to write a function for particular cases.
SEXP eval_any_or2(SEXP x, SEXP b, SEXP y) {
R_xlen_t N = xlength(x);
if (xlength(y) != N || xlength(b) != 1) {
error("Wrong lengths.");
}
const int *xp = INTEGER(x);
const int *yp = INTEGER(y);
const int *bp = INTEGER(b);
bool o = false;
for (R_xlen_t i = 0; i < N; ++i) {
if (xp[i] >= bp[0] || xp[i] == yp[i]) {
o = true;
break;
}
}
SEXP ans = PROTECT(allocVector(LGLSXP, 1));
LOGICAL(ans)[0] = o ? TRUE : FALSE;
UNPROTECT(1);
return ans;
}
However, for clarity I'd like to keep as much of the natural syntax as possible, like any_or(x >= b, x == y). So I'd like to be able to detect whether a call is of the form <vector> <operator> <vector> when <operator> is one of the standard binary operators, and each <vector> is of equal length vectors length 1. Something like this:
any_or2 <- function(expr1, expr2) {
sexp1 <- substitute(expr1)
sexp2 <- substitute(expr2)
if (!is_binary_sexp(sexp1) || !is_binary_sexp(sexp2) {
# fall through to just basic R
return(any(expr1 | expr2))
}
# In C
eval_any_or2(...) # either the substituted expression or x,y,b
}
I've attempted the following C function which detects whether a substituted expression/call is a binary expression, but (a) I'm having trouble detecting whether the operator is a binary operator and (b) getting the vectors from the expression (x, y, b in the example) to use later (either in the same C function or as passed to a C function like the one above).
#define return_false SEXP ans = PROTECT(allocVector(LGLSXP, 1)); \
LOGICAL(ans)[0] = FALSE; \
UNPROTECT(1); \
return ans; \
SEXP is_binary_sexp(SEXP sx) {
if (TYPEOF(sx) != LANGSXP) {
return_false
}
// does it have three elements?
int len = 0;
SEXP el, nxt;
for (nxt = sx; nxt != R_NilValue || len > 4; el = CAR(nxt), nxt = CDR(nxt)) {
len++;
}
if (len != 3) {
return_false;
}
if (TYPEOF(CAR(sx)) != SYMSXP) {
return_false;
}
SEXP ans = PROTECT(allocVector(LGLSXP, 1));
LOGICAL(ans)[0] = TRUE;
UNPROTECT(1);
return ans;
}
In R I would write something like:
is_binary_sexp_R <- function(sexprA) {
# sexprA is the result of substitute()
is.call(sexprA) &&
length(sexprA) == 3L &&
match(as.character(sexprA[[1]]), c("!=", "==", "<=", ">=", "<", ">"), nomatch = 0L) &&
is.name(lhs <- sexprA[[2L]])
}
but I'd like to do as much as possible in C.

Allocate Rcpp List of n NumericMatrix

Is there a way to allocate an Rcpp List of length n, where each element of the List will be filled with a NumericMatrix, but the size of each NumericMatrix can change?
I have an idea for doing this using std::list and push_back(), but the size of the list may be quite large and I want to avoid the overhead of creating an extra copy of the list when I return from the function.
The below R code gives an idea of what I hope to do:
myvec = function(n) {
x = vector("list", n)
for (i in seq_len(n)) {
nc = sample(1:3, 1)
nr = sample(1:3, 1)
x[[i]] = matrix(rbinom(nc * nr, size = 1, prob = 0.5),
nrow = nr, ncol = nc)
}
x
}
This could result in something like:
> myvec(2)
[[1]]
[,1]
[1,] 0
[2,] 1
[[2]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 1 1
Update: based on the comments of #Dirk and #Ralf, I created functions based on Rcpp::List and std::list with a wrap at the end. Speed comparisons don't seem to favor one version over the other, but perhaps there's an inefficiency I'm not aware of.
src = '
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::RNGScope rngScope;
Rcpp::List x(n);
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
// [[Rcpp::export]]
Rcpp::List myvec2(int n) {
Rcpp::RNGScope scope;
std::list< Rcpp::NumericMatrix > x;
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x.push_back( Rcpp::NumericMatrix(nc, nr, entries.begin()));
}
return Rcpp::wrap(x);
}
'
sourceCpp(code = src)
Resulting benchmarks on my computer are:
> library(microbenchmark)
> rcpp_list = function() {
+ set.seed(10);myvec(105)
+ }
> std_list = function() {
+ set.seed(10);myvec2(105)
+ }
> microbenchmark(rcpp_list(), std_list(), times = 1000)
Unit: milliseconds
expr min lq mean median uq
rcpp_list() 1.8901 1.92535 2.205286 1.96640 2.22380
std_list() 1.9164 1.95570 2.224941 2.00555 2.32315
max neval cld
7.1569 1000 a
7.1194 1000 a
The fundamental issue that Rcpp objects are R objects governed my R's memory management where resizing is expensive: full copies.
So when I have tasks similar to yours where sizes may change, or are unknown, I often work with different data structures -- the STL gives us plenty -- and only convert to R(cpp) at the return step at the end.
The devil in the detail here (as always). Profile, experiment, ...
Edit: And in the narrower sense of "can we return a List of NumericMatrix objects with varying sizes" the answer is of course we can because that is what List objects do. You can also insert other types.
As Dirk said, it is of course possible to create a list with matrices of different size. To make it a bit more concrete, here a translation of your R function:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::List x(n);
Rcpp::IntegerVector choices = {1, 2 ,3};
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
/***R
myvec(2)
*/
The main difference to the R code are the explicitly named vectors choices and entries, which are only implicit in the R code.

Interleaving results from many objects in Rcpp

I need to write to a file row by row of matrices and sparse matrices that appears in a list and I am doing something like this:
#include <RcppArmadillo.h>
// [[Rcpp::export]]
bool write_rows (Rcpp::List data, Rcpp::CharacterVector clss, int n) {
int len = data.length();
for(int i = 0; i<n; i++) {
for(int j=0; j<len; j++) {
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
auto row = x.row(i);
// do something with row i
} else if (clss[j] == "dgCMatrix") {
arma::sp_mat x = data[j];
auto row = x.row(i);
// do something different with row i
}
}
}
return true;
}
This function can be called in R with:
data <- list(
x = Matrix::rsparsematrix(nrow = 1000, ncol = 1000, density = 0.3),
y = matrix(1:10000, nrow = 1000, ncol = 10)
)
clss <- c("dgCMatrix", "matrix")
write_rows(data, clss, 1000)
The function receives a list of matrices or sparse matrices with the same number of rows and writes those matrices row by row, ie. first writes first rows of all elements in data then the second row of all elements and etc.
My problem is that it seems that this line arma::sp_mat x = data[i]; seems to have a huge impact in performance since it seems that I am implicitly casting the list element data[j] to an Armadillo Sparse Matrix n times.
My question is: is there anyway I could avoid this? Is there a more efficient solution? I tried to find a solution by looking into readr's source code, since they also write list elements row by row, but they also do a cast for each row (in this line for example, but maybe this doesn't impact the performance because they deal with SEXPS?
With the clarification, it seems that the result should interleave the rows from each matrix. You can still do this while avoiding multiple conversions.
This is the original code, modified to generate some actual output:
// [[Rcpp::export]]
arma::mat write_rows(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
int len = data.length();
arma::mat result(nrows*len, ncols);
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++) {
arma::rowvec r;
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
r = x.row(i);
}
else {
arma::sp_mat x = data[j];
r = x.row(i);
}
result.row(k++) = r;
}
}
return result;
}
The following code creates a vector of converted objects, and then extracts the rows from each object as required. The conversion is only done once per matrix. I use a struct containing a dense and sparse mat because it's a lot simpler than dealing with unions; and I don't want to drag in boost::variant or require C++17. Since there's only 2 classes we want to deal with, the overhead is minimal.
struct Matrix_types {
arma::mat m;
arma::sp_mat M;
};
// [[Rcpp::export]]
arma::mat write_rows2(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
const int len = data.length();
std::vector<Matrix_types> matr(len);
std::vector<bool> is_dense(len);
arma::mat result(nrows*len, ncols);
// populate the structs
for (int j = 0; j < len; j++) {
is_dense[j] = (clss[j] == "matrix");
if (is_dense[j]) {
matr[j].m = Rcpp::as<arma::mat>(data[j]);
}
else {
matr[j].M = Rcpp::as<arma::sp_mat>(data[j]);
}
}
// populate the result
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++, k++) {
if (is_dense[j]) {
result.row(k) = matr[j].m.row(i);
}
else {
arma::rowvec r(matr[j].M.row(i));
result.row(k) = r;
}
}
}
return result;
}
Running on some test data:
data <- list(
a=Matrix(1.0, 1000, 1000, sparse=TRUE),
b=matrix(2.0, 1000, 1000),
c=Matrix(3.0, 1000, 1000, sparse=TRUE),
d=matrix(4.0, 1000, 1000)
)
system.time(z <- write_rows(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 185.75 35.04 221.38
system.time(z2 <- write_rows2(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 4.21 0.05 4.25
identical(z, z2)
# [1] TRUE

Rcpp function for adding elements of a vector

I have a very long vector of parameters (approximately 4^10 elements) and a vector of indices. My aim is to add together all of the values of the parameters that are indexed in the indices vector.
For instance, if I had paras = [1,2,3,4,5,5,5] and indices = [3,3,1,6] then I would want to find the cumulative sum of the third value (3) twice, the first value (1) and the sixth (5), to get 12. There is additionally the option of warping the parameter values according to their location.
I am trying to speed up an R implementation, as I am calling it millions of times.
My current code always returns NA, and I can't see where it is going wrong
Here's the Rcpp function:
double dot_prod_c(NumericVector indices, NumericVector paras,
NumericVector warp = NA_REAL) {
int len = indices.size();
LogicalVector indices_ok;
for (int i = 0; i < len; i++){
indices_ok.push_back(R_IsNA(indices[i]));
}
if(is_true(any(indices_ok))){
return NA_REAL;
}
double counter = 0;
if(NumericVector::is_na(warp[1])){
for (int i = 0; i < len; i++){
counter += paras[indices[i]];
}
} else {
for (int i = 0; i < len; i++){
counter += paras[indices[i]] * warp[i];
}
}
return counter;
}
And here is the working R version:
dot_prod <- function(indices, paras, warp = NA){
if(is.na(warp[1])){
return(sum(sapply(indices, function(ind) paras[ind + 1])))
} else {
return(sum(sapply(1:length(indices), function(i){
ind <- indices[i]
paras[ind + 1] * warp[i]
})))
}
}
Here is some code for testing, and benchmarking using the microbenchmark package:
# testing
library(Rcpp)
library(microbenchmark)
parameters <- list()
indices <- list()
indices_trad <- list()
set.seed(2)
for (i in 4:12){
size <- 4^i
window_size <- 100
parameters[[i-3]] <- runif(size)
indices[[i-3]] <- floor(runif(window_size)*size)
temp <- rep(0, size)
for (j in 1:window_size){
temp[indices[[i-3]][j] + 1] <- temp[indices[[i-3]][j] + 1] + 1
}
indices_trad[[i-3]] <- temp
}
microbenchmark(
x <- sapply(1:9, function(i) dot_prod(indices[[i]], parameters[[i]])),
x_c <- sapply(1:9, function(i) dot_prod_c(indices[[i]], parameters[[i]])),
x_base <- sapply(1:9, function(i) indices_trad[[i]] %*% parameters[[i]])
)
all.equal(x, x_base) # is true, does work
all.equal(x_c, x_base) # not true - C++ version returns only NAs
I was having a little trouble trying to interpret your overall goal through your code, so I'm just going to go with this explanation
For instance, if I had paras = [1,2,3,4,5,5,5] and indices = [3,3,1,6]
then I would want to find the cumulative sum of the third value (3)
twice, the first value (1) and the sixth (5), to get 12. There is
additionally the option of warping the parameter values according to
their location.
since it was most clear to me.
There are some issues with your C++ code. To start, instead of doing this - NumericVector warp = NA_REAL - use the Rcpp::Nullable<> template (shown below). This will solve a few problems:
It's more readable. If you're not familiar with the Nullable class, it's pretty much exactly what it sounds like - an object that may or may not be null.
You won't have to make any awkward initializations, such as NumericVector warp = NA_REAL. Frankly I was surprised that the compiler accepted this.
You won't have to worry about accidentally forgetting that C++ uses zero-based indexing, unlike R, as in this line: if(NumericVector::is_na(warp[1])){. That has undefined behavior written all over it.
Here's a revised version, going off of your quoted description of the problem above:
#include <Rcpp.h>
typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t;
// [[Rcpp::export]]
double DotProd(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) {
R_xlen_t i = 0, n = indices.size();
double result = 0.0;
if (warp_.isNull()) {
for ( ; i < n; i++) {
result += params[indices[i]];
}
} else {
Rcpp::NumericVector warp(warp_);
for ( ; i < n; i++) {
result += params[indices[i]] * warp[i];
}
}
return result;
}
You had some elaborate code to generate sample data. I didn't take the time to go through this because it wasn't necessary, nor was the benchmarking. You stated yourself that the C++ version wasn't producing the correct results. Your first priority should be to get your code working on simple data. Then feed it some more complex data. Then benchmark. The revised version above works on simple data:
args <- list(
indices = c(3, 3, 1, 6),
params = c(1, 2, 3, 4, 5, 5, 5),
warp = c(.25, .75, 1.25, 1.75)
)
all.equal(
DotProd(args[[1]], args[[2]]),
dot_prod(args[[1]], args[[2]]))
#[1] TRUE
all.equal(
DotProd(args[[1]], args[[2]], args[[3]]),
dot_prod(args[[1]], args[[2]], args[[3]]))
#[1] TRUE
It's also faster than the R version on this sample data. I have no reason to believe it wouldn't be for larger, more complex data either - there's nothing magical or particularly efficient about the *apply functions; they are just more idiomatic / readable R.
microbenchmark::microbenchmark(
"Rcpp" = DotProd(args[[1]], args[[2]]),
"R" = dot_prod(args[[1]], args[[2]]))
#Unit: microseconds
#expr min lq mean median uq max neval
#Rcpp 2.463 2.8815 3.52907 3.3265 3.8445 18.823 100
#R 18.869 20.0285 21.60490 20.4400 21.0745 66.531 100
#
microbenchmark::microbenchmark(
"Rcpp" = DotProd(args[[1]], args[[2]], args[[3]]),
"R" = dot_prod(args[[1]], args[[2]], args[[3]]))
#Unit: microseconds
#expr min lq mean median uq max neval
#Rcpp 2.680 3.0430 3.84796 3.701 4.1360 12.304 100
#R 21.587 22.6855 23.79194 23.342 23.8565 68.473 100
I omitted the NA checks from the example above, but that too can be revised into something more idiomatic by using a little Rcpp sugar. Previously, you were doing this:
LogicalVector indices_ok;
for (int i = 0; i < len; i++){
indices_ok.push_back(R_IsNA(indices[i]));
}
if(is_true(any(indices_ok))){
return NA_REAL;
}
It's a little aggressive - you are testing a whole vector of values (with R_IsNA), and then applying is_true(any(indices_ok)) - when you could just break prematurely and return NA_REAL on the first instance of R_IsNA(indices[i]) resulting in true. Also, the use of push_back will slow down your function quite a bit - you would have been better off initializing indices_ok to the known size and filling it by index access in your loop. Nevertheless, here's one way to condense the operation:
if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL;
For completeness, here's a fully sugar-ized version which allows you to avoid loops entirely:
#include <Rcpp.h>
typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t;
// [[Rcpp::export]]
double DotProd3(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) {
if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL;
if (warp_.isNull()) {
Rcpp::NumericVector tmp = params[indices];
return Rcpp::sum(tmp);
} else {
Rcpp::NumericVector warp(warp_), tmp = params[indices];
return Rcpp::sum(tmp * warp);
}
}
/*** R
all.equal(
DotProd3(args[[1]], args[[2]]),
dot_prod(args[[1]], args[[2]]))
#[1] TRUE
all.equal(
DotProd3(args[[1]], args[[2]], args[[3]]),
dot_prod(args[[1]], args[[2]], args[[3]]))
#[1] TRUE
*/

Removing an element from a list in Rcpp

Suppose I have the following list:
x <- list(a = c(1, 2), b = c("a", "c"), c = 1:10)
In R, I can remove the first element using the following two methods:
x[-1]
x[1] <- NULL
I'm trying to do same thing in Rcpp, but I can't figure it out. Following code just assigns NULL to the first element.
// [[Rcpp::export]]
Rcpp::List removeElement(Rcpp::List x)
{
x[0] = R_NilValue;
return(x);
}
Any ideas?
What about
// [[Rcpp::export]]
Rcpp::List removeElement(Rcpp::List x, int j)
{
IntegerVector idx = seq_len(x.length());
return(x[idx != j]);
}
Or if you want the indices to start from 0 use
IntegerVector idx = seq_len(x.length()) - 1;

Resources