Converting String Versions of "Infinity" to Numeric in Rcpp - r

I have some JSON response that encodes Inf/-Inf/NaN as strings, so the JSON array it returns will look like [1.0, "Infinity", 2.0]. I parse this using a JSON library and end up with a list that looks like list(1.0, "Infinity", 2.0) and I want to convert it to be list(1.0, Inf, 2.0), for performance reasons I need this to use Rcpp. Here is the code I tried doing but I can't seem to get Rcpp to not yell at me about
library(Rcpp)
cppFunction('
NumericVector convertThings(List data) {
const size_t num_rows = data.size();
NumericVector rv(num_rows);
for (size_t i = 0; i < num_rows; ++i) {
if (as<String>(data[i]) == "Infinity") {
rv[i] = R_PosInf;
} else {
rv[i] = as<double>(data[i]);
}
}
return rv;
}
')
convertThings(list('Infinity', 1.0))
# expected output c(Inf, 1.0)
The error I am seeing is Error: not compatible with requested type. Help is much appreciated!

That is a basic C++ problem: how to convert text to numbers reliably.
One possibly answer is provided by the Boost.Lexical_Cast library and illustrated in this Rcpp Gallery post. Just using the first example:
R> library(Rcpp)
R> sourceCpp("/tmp/boostLexicalCastExample.cpp") # from post
R> lexcicalCast(c("Inf", "inf", "Infinity", "NA", 42))
[1] Inf Inf Inf NA 42
R>
As you can see, it matches at least three different ways of spelling infinity in text.

Related

Setting colnames in R's cpp11

Provided that cpp11 does not provide any "sugar", we need to use attributes.
I am trying to set colnames in a C++ function, as in the next MWE
#include "cpp11.hpp"
using namespace cpp11;
// THIS WORKS
[[cpp11::register]]
doubles cpp_names(writable::doubles X) {
X.attr("names") = {"a","b"};
return X;
}
// THIS WON'T
[[cpp11::register]]
doubles_matrix<> cpp_colnames(writable::doubles_matrix<> X) {
X.attr("dimnames") = list(NULL, {"A","B"});
return X;
}
How can I pass a list with two vectors, one NULL and the other ("a","b"), that can be correctly converted to a SEXP?
In Rcpp, one would do colnames(X) = {"a","b"}.
My approach can be wrong.

How is noNA used in Rcpp?

In his "Advanced R" book, Hadley Wickham says "noNA(x) asserts that the vector x does not contain any missing values." However I still don't know how to use it. I can't do
if (noNA(x))
do this
so how am I supposed to use it?
http://adv-r.had.co.nz/Rcpp.html#rcpp-sugar
Many of the Rcpp sugar expressions are implemented through template classes which have specializations for cases when the input object is known to be free of missing values, thereby allowing the underlying algorithm to avoid having to perform the extra work of dealing with NA values (e.g. calls to is_na). This is only possible because the VectorBase class has a boolean parameter indicating whether the underlying object can (can, not that it necessarily does) have NA values, or not.
noNA returns (when called on a VectorBase object) an instance of the Nona template class. Note that Nona itself derives from
Rcpp::VectorBase<RTYPE, false, Nona<RTYPE,NA,VECTOR>>
// ^^^^^
meaning that the returned object gets encoded with information that essentially says "you can assume that this data is free of NA values".
As an example, Rcpp::sum is implemented via the Sum class in the Rcpp::sugar namespace. In the default case, we see that there is extra work to manage the possibility of missing values:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
STORAGE current ;
for( R_xlen_t i=0; i<n; i++){
current = object[i] ;
if( Rcpp::traits::is_na<RTYPE>(current) ) // here
return Rcpp::traits::get_na<RTYPE>() ; // here
result += current ;
}
return result ;
}
On the other hand, there is also a specialization for cases when the input does not have missing values, in which the algorithm does less work:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
for( R_xlen_t i=0; i<n; i++){
result += object[i] ;
}
return result ;
}
To answer your question of "how do I apply this in practice?", here is an example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Sum(IntegerVector x) {
return sum(x);
}
// [[Rcpp::export]]
int SumNoNA(IntegerVector x) {
return sum(noNA(x));
}
Benchmarking these two functions,
set.seed(123)
x <- as.integer(rpois(1e6, 25))
all.equal(Sum(x), SumNoNA(x))
# [1] TRUE
microbenchmark::microbenchmark(
Sum(x),
SumNoNA(x),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Sum(x) 577.386 664.620 701.2422 677.1640 731.7090 1214.447 500
# SumNoNA(x) 454.990 517.709 556.5783 535.1935 582.7065 1138.426 500
the noNA version is indeed faster.

memory efficient method to calculate distance matrix [duplicate]

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.
The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.
Edit
The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:
You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.
Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the #address and other initializations.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
My little wrapper
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat#address, zeros#address)
return(zeros)
}
The test
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE

Remove NA values efficiently

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.
Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100
I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.

Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`

Is there a way to pass a data.table objects to c++ functions using Rcpp and/or RcppArmadillo without manually transforming to data.table to a data.frame? In the example below test_rcpp(X2) and test_arma(X2) both fail with c++ exception (unknown reason).
R code
X=data.frame(c(1:100),c(1:100))
X2=data.table(X)
test_rcpp(X)
test_rcpp(X2)
test_arma(X)
test_arma(X2)
c++ functions
NumericMatrix test_rcpp(NumericMatrix X) {
return(X);
}
mat test_arma(mat X) {
return(X);
}
Building on top of other answers, here is some example code:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
CharacterVector x = df["x"] ;
NumericVector y = df["y"] ;
IntegerVector z = df["v"] ;
/* do whatever with x, y, v */
double res = sum(y) ;
return res ;
}
So, as Matthew says, this treats the data.table as a data.frame (aka a Rcpp::DataFrame in Rcpp).
require(data.table)
DT <- data.table(
x=rep(c("a","b","c"),each=3),
y=c(1,3,6),
v=1:9)
do_stuff_with_a_data_table( DT )
# [1] 30
This completely ignores the internals of the data.table.
Try passing the data.table as a DataFrame rather than NumericMatrix. It is a data.frame anyway, with the same structure, so you shouldn't need to convert it.
Rcpp sits on top of native R types encoded as SEXP. This includes eg data.frame or matrix.
data.table is not native, it is an add-on. So someone who wants this (you?) has to write a converter, or provide funding for someone else to write one.
For reference, I think the good thing is to output a list from rcpp as data.table allow update via lists.
Here is a dummy example:
cCode <-
'
DataFrame DT(DTi);
NumericVector x = DT["x"];
int N = x.size();
LogicalVector b(N);
NumericVector d(N);
for(int i=0; i<N; i++){
b[i] = x[i]<=4;
d[i] = x[i]+1.;
}
return Rcpp::List::create(Rcpp::Named("b") = b, Rcpp::Named("d") = d);
';
require("data.table");
require("rcpp");
require("inline");
DT <- data.table(x=1:9,y=sample(letters,9)) #declare a data.table
modDataTable <- cxxfunction(signature(DTi="data.frame"), plugin="Rcpp", body=cCode)
DT_add <- modDataTable(DT) #here we get the list
DT[, names(DT_add):=DT_add] #here we update by reference the data.table

Resources