R: Fast hash search in lists (environment) - r

I want to have a very quick search and it seems, that using hashes (via environments) is the best way. Now, I got an example to run with environments, but it does not return what I need.
Here is an example:
a <- data.table::data.table(a=c(1, 3, 5), b=c(2, 4, 6), time=c(10, 20, 30))
my_env <- list2env(a)
x <- a[2, .(a, b)] # x=c(3,4)
found <- get("x", envir = my_env)
I would expect found = c(3, 4, 20) but receive found = c(3, 4)
(I want the whole row to be returned instead of the unknown row subset)
Backround: I have a huge list containing source and destination of routes calculated with osrm, e.g.
lattitude1, longitude1, lattitude2, longitude2, travel-time
46.12, 8.32, 47.87, 9.92, 1036
...
The list contains in a first example about 100000 rows. Using binary search in a data.table speeded up my code by a factor 100, but one search still takes 1 ms. As I have to search for many routes during a simulation (About 2e5 searches) I would like to get even faster.
#Gregor: I am a beginner in R, but I don't think my question is a duplicate:
I knew the second link , which is an abstract overview for experts listing possibilities. Furthermore, it is 4 years old.
I didn't know the first link, but from those answers I can't see whether I should switch to environments and how an implementation could work at all. There is also no discussion about searching a part of a huge list.
Summary (Thanks to DigEmAll for his running example below):
Using Rcpp on integers, the search is less memory consuming without any loss of quality. Futhermore, it is about a factor of 3 faster.
Do not use hashed environments when you want to look up doubles (which have to be converted to strings).
Implementation in existing code should be easy.

Here's an example using enviroment and data.table, the code is pretty self-explanatory :
library(data.table)
# create a big random example (160k rows)
set.seed(123)
fromTo <- expand.grid(1:400,1:400)
colnames(fromTo) <- c('a','b')
DF <- as.data.frame(cbind(fromTo,time=as.integer(runif(nrow(fromTo), min = 1, max=500))))
# setup the environment to use it as hashtable:
# we simply put the times inside an enviroment using
# a|b (concatenation of a with b) as key
timesList <- as.list(DF$time)
names(timesList) <- paste(DF$a,DF$b,sep='|')
timesEnv <- list2env(timesList)
# setup the data.table to use it as hashtable
DT <- setDT(DF,key=c('a','b'))
# create search functions
searchUsingEnv <- function(a,b){
time <- get(paste(a,b,sep='|'),envir=timesEnv,inherits=FALSE)
return(time)
}
searchUsingDataTable <- function(from,to){
time <- DT[.(from,to),time]
return(time)
}
Benchmark :
# benchmark functions
# i.e. we try to search ~16K rows in ourtwo kind of hashtables
benchEnv <- function(){
n <- nrow(fromTo)
s <- as.integer(n * 0.9)
for(i in s:n){
searchUsingEnv(fromTo[i,'a'],fromTo[i,'b'])
}
}
benchDT <- function(){
n <- nrow(fromTo)
s <- as.integer(n * 0.9)
for(i in s:n){
searchUsingDataTable(fromTo[i,'a'],fromTo[i,'b'])
}
}
# let's measure the performances
> system.time(benchEnv(), gcFirst = TRUE)
user system elapsed
2.26 0.00 2.30
> system.time(benchDT(), gcFirst = TRUE)
user system elapsed
42.34 0.00 42.56
Conclusions:
environment seems much faster then data.table for repeated single key access, so you can try to use it.
EDIT :
Enviroments have fast access but they can only have string keys which occupy more memory than doubles. So, I've added an example using Rcpp and std::map<> with a multiple values map :
(note: if you are on Windows you need to install RTools in order to make Rcpp work)
library(data.table)
library(Rcpp)
library(inline)
nRows <- 1e7
############# create data.table "DT" containing coordinates and times
generate_routes_dt <- function(nmax) {
set.seed(123)
routes <- data.table(lat1 = numeric(nmax),
lng1 = numeric(nmax),
lat2 = numeric(nmax),
lng2 = numeric(nmax),
time = numeric(nmax))
tmp <- sample(seq(46, 49, length.out = nmax), nmax)
routes$lat1 <- tmp
tmp <- sample(seq(8, 10, length.out = nmax), nmax)
routes$lng1 <- tmp
tmp <- sample(seq(46, 49, length.out = nmax), nmax)
routes$lat2 <- tmp
tmp <- sample(seq(8, 10, length.out = nmax), nmax)
routes$lng2 <- tmp
tmp <- sample(seq(0, 1e7, length.out = nmax), nmax)
routes$time <- as.integer(tmp)
data.table::setkey(routes, lat1, lng1, lat2, lng2)
return(routes)
}
DT <- generate_routes_dt(nRows)
############# create data.table search function
searchUsingDataTable <- function(lat_1,lng_1,lat_2,lng_2){
time <- DT[.(lat_1,lng_1,lat_2,lng_2),time]
return(time)
}
#############
############# create Rcpp search function
# the following code create 2 functions: createMap and getTime
# usage:
# map <- createMap(lat1Vec,lng1Vec,lat2Vec,lng2Vec,timesVec)
# t <- getTime(map,lat1,lng1,lat2,lng2)
sourceCpp(code=
'
#include <Rcpp.h>
class MultiKey {
public:
double lat1;
double lng1;
double lat2;
double lng2;
MultiKey(double la1, double ln1, double la2, double ln2)
: lat1(la1), lng1(ln1), lat2(la2), lng2(ln2) {}
bool operator<(const MultiKey &right) const
{
if ( lat1 == right.lat1 ) {
if ( lng1 == right.lng1 ) {
if ( lat2 == right.lat2 ) {
return lng2 < right.lng2;
}
else {
return lat2 < right.lat2;
}
}
else {
return lng1 < right.lng1;
}
}
else {
return lat1 < right.lat1;
}
}
};
// [[Rcpp::export]]
SEXP createMap(Rcpp::NumericVector lat1,
Rcpp::NumericVector lng1,
Rcpp::NumericVector lat2,
Rcpp::NumericVector lng2,
Rcpp::NumericVector times){
std::map<MultiKey, double>* map = new std::map<MultiKey, double>;
int n1 = lat1.size();
int n2 = lng1.size();
int n3 = lat2.size();
int n4 = lng2.size();
int n5 = times.size();
if(!(n1 == n2 && n2 == n3 && n3 == n4 && n4 == n5)){
throw std::range_error("input vectors lengths are different");
}
for(int i = 0; i < n1; i++){
MultiKey key(lat1[i],lng1[i],lat2[i],lng2[i]);
map->insert(std::pair<MultiKey, double>(key, times[i]));
}
Rcpp::XPtr< std::map<MultiKey, double> > p(map, true);
return( p );
}
// [[Rcpp::export]]
Rcpp::NumericVector getTime(SEXP mapPtr,
double lat1,
double lng1,
double lat2,
double lng2){
Rcpp::XPtr< std::map<MultiKey, double> > ptr(mapPtr);
MultiKey key(lat1,lng1,lat2,lng2);
std::map<MultiKey,double>::iterator it = ptr->find(key);
if(it == ptr->end())
return R_NilValue;
return Rcpp::wrap(it->second);
}
')
map <- createMap(DT$lat1,DT$lng1,DT$lat2,DT$lng2,DT$time)
searchUsingRcpp <- function(lat_1,lng_1,lat_2,lng_2){
time <- getTime(map,lat_1,lng_1,lat_2,lng_2)
return(time)
}
#############
############# benchmark
set.seed(1234)
rowsToSearchOneByOne <- DT[sample.int(nrow(DT),size=nrow(DT),replace=FALSE),]
bench <- function(searchFun2Use){
for(i in nrow(rowsToSearchOneByOne)){
key <- rowsToSearchOneByOne[i,]
searchFun2Use(key$lat1,key$lng1,key$lat2,key$lng2)
}
}
microbenchmark::microbenchmark(
bench(searchUsingRcpp),
bench(searchUsingDataTable),
times=100)
#############
Benchmark result :
Unit: microseconds
expr min lq mean median uq max neval
bench(searchUsingRcpp) 360.959 381.7585 400.4466 391.999 403.9985 665.597 100
bench(searchUsingDataTable) 1103.034 1138.0740 1214.3008 1163.514 1224.9530 2035.828 100
Note:
I really don't think that using double as keys is a good idea... floating point values should be used to search using a certain tolerance or inside a range, not to look up for perfect match inside a map.

Related

Fast random sampling from matrix of cumulative probability mass functions in R

I have a matrix (mat_cdf) representing the cumulative probability an individual in census tract i moves to census tract j on a given day. Given a vector of agents who decide not to "stay home", I have a function, GetCTMove function below, to randomly sample from this matrix to determine which census tract they will spend time in.
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
# Matrix where rows correspond to probability mass of column integer
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
# Convert to cumulative probability mass
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
# Create cdf for row i
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
GetCTMove <- function(agent_cts, ct_mat_cdf){
# Expand such that every agent has its own row corresponding to CDF of movement from their home ct i to j
mat_expand <- ct_mat_cdf[agent_cts,]
# Probabilistically sample column index for every row by generating random number and then determining corresponding closest column
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Sample of 500,000 agents' residence ct
agents <- sample(1:cts, size = 500000, replace = T)
# Run function
system.time(GetCTMove(agents, mat_cdf))
user system elapsed
3.09 1.19 4.30
Working with 1 million agents, each sample takes ~10 seconds to run, multiplied by many time steps leads to hours for each simulation, and this function is by far the rate limiting factor of the model. I'm wondering if anyone has advice on faster implementation of this kind of random sampling. I've used the dqrng package to speed up random number generation, but that's relatively miniscule in comparison to the matrix expansion (mat_expand) and max.col calls which take longest to run.
The first thing that you can optimise is the following code:
max.col(s < mat_expand, "first")
Since s < mat_expand returns a logical matrix, applying the max.col function is the same as getting the first TRUE in each row. In this case, using which will be much more efficient. Also, as shown below, you store all your CDFs in a matrix.
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
This structure may not be optimal. A list structure is better for applying functions like which. It is also faster to run as you do not have to go through a do.call(rbind, ...).
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
Below is your implementation:
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
On my desktop, it takes about 2.68s to run.
> system.time(GetCTMove(agents, mat_cdf))
user system elapsed
2.25 0.41 2.68
With a list structure and a which function, the run time can be reduced by about 1s.
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
> system.time(GetCTMove2(agents, ls_cdf))
user system elapsed
1.59 0.02 1.64
To my knowledge, with R only there is no other way to further speed up the code. However, you can indeed improve the performance by re-writing the key function GetCTMove in C++. With the Rcpp package, you can do something as follows:
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
This implementation is lightning fast, which should fulfil all your needs.
> system.time(GetCTMove3(agents, ls_cdf))
user system elapsed
0.07 0.00 0.06
The full script is attached as follows:
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
agents <- sample(1:cts, size = 500000, replace = T)
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
# below is your code
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
system.time(GetCTMove(agents, mat_cdf))
system.time(GetCTMove2(agents, ls_cdf))
system.time(GetCTMove3(agents, ls_cdf))

How to improve processing time for euclidean distance calculation

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100

How to reverse a number in R

I want to write a function to reverse the order of any numbers. Here is what I have but it doesn't work. Please help me!
n=123
rev_number=function(n){
m=strsplit(as.character(n),"")
if (m==rev(m)) print("reversed number")
}
The desired output is n=321
I feel like reverse an integer should stay in the integer world instead of getting into the string manipulation world. It seems there isn't a built in function for such task in R, so we can create one, using the Rcpp package for instance. Here's an example
library(Rcpp)
cppFunction('int Reverse_CPP(int x) {
int reverse = 0;
while(x != 0) {
int remainder = x%10;
reverse = reverse*10 + remainder;
x/= 10;
}
return reverse ;
}')
Reverse_CPP(1234)
# [1] 4321
And here's a vectorized version
cppFunction('IntegerVector Reverse_CPP2(IntegerVector x) {
int n = x.size();
IntegerVector out(n);
IntegerVector xx = clone(x); // Will need this if you don"t want to modify x in place
for (int i = 0; i < n; ++i){
int reverse = 0;
while(xx[i] != 0) {
int remainder = xx[i]%10;
reverse = reverse*10 + remainder;
xx[i]/= 10;
}
out[i] = reverse;
}
return out;
}')
Reverse_CPP2(c(12345, 21331, 4324234, 4243))
# [1] 54321 13312 4324234 3424
Note that I had to add IntegerVector xx = clone(x); and hence slow the function drastically (see #alexis_laz comment) as Rcpp will modify the original x by reference otherwise. You don't need that if you are passing a bare vector or if you don't care if the original vector is being modifyied
Some benchmarks against other vectorized string manipulation functions
Stringi <- function(x) as.integer(stringi::stri_reverse(x))
Base <- function(x) {
as.integer(vapply(lapply(strsplit(as.character(x), "", fixed = TRUE), rev),
paste, collapse = "", FUN.VALUE = character(1L)))
}
library(microbenchmark)
set.seed(123)
x <- sample(1e3L:1e7L, 1e5, replace = TRUE)
microbenchmark(
Base(x),
Stringi(x),
Reverse_CPP2(x)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Base(x) 855.985729 913.602215 994.60640 976.836206 1025.482170 1867.448511 100 c
# Stringi(x) 86.244426 94.882566 105.58049 102.962924 110.334702 179.918461 100 b
# Reverse_CPP2(x) 1.842699 1.865594 2.06674 1.947703 2.076983 6.546552 100 a
Here's my attempt by extending your code
n=-123
rev_number=function(n){
if(n>0)
return(as.integer(paste0(rev(unlist(strsplit(as.character(n), ""))), collapse = "")))
else{
return(-as.integer(paste0(rev(unlist(strsplit(as.character(abs(n)), ""))), collapse = "")))
}
}
rev_number(n)
If else is to handle negative numbers. First it converts the integer to characters and then uses strsplit to split the number. Then unlist is used to convert the list to vector. Then we can use rev to reverse the vector and collapse the vector using paste. Finally as.integer is used to convert the string in to int.
If you do not wish to convert the number to characters, below one does not need any additional packages but only works with positive integers
reverse_number <- function(n){
rev <- 0
while (n > 0) {
r <- n %% 10
rev <- rev * 10 + r
n <- n %/% 10
}
return(rev)
}
reverse_number(134576)
For integers n > 9 this function can be used:
reverse_int <- function(n) {
t1 <- floor(log10(n))
t2 <- 0
for (i in t1:1) t2 <- t2 + floor(n/10^i) * 10^(t1-i)
return(n*10^t1 - 99*t2)
}
reverse_int(678754)
#[1] 457876
Note that the function is not vectorized; it only takes one parameter n as input.
An R function to reverse numbers based on integer division with successive powers of 10. This came up in a school project related to palindrome numbers.
Reverse_number <- function(x){
n <- trunc(log10(x)) # now many powers of 10 are we dealing with
x.rem <- x # the remaining numbers to be reversed
x.out <- 0 # stores the output
for(i in n:0){
x.out <- x.out + (x.rem %/% 10^i)*10^(n-i) # multiply and add
x.rem <- x.rem - (x.rem %/% 10^i)*10^i # multiply and subtract
}
return(x.out)
}
Here are some other base R options using utf8ToInt or substring
> n <- 123478634
> as.integer(intToUtf8(rev(utf8ToInt(as.character(n)))))
[1] 436874321
> as.integer(paste0(substring(n, nchar(n):1, nchar(n):1), collapse = ""))
[1] 436874321

My C++ functions with Rcpp::List inputs are very slow

While C++ and specifically the Rcpp package have been tremendously helpful to me in speeding up my codes, I noticed that my C++ functions which have a list or data frame input argument(arguments of the form Rcpp::DataFrame and Rcpp::List) are very slower compared to my other C++ functions. I wrote a sample code and I wanted to ask for tricks that can make my code faster:
First, let's simulate a List in R that contains two Lists inside of it. Consider myList as a list that includes two lists - measure1 and measure2. measure1 and measure2 are lists themselves each include vectors of measurements for subjects. Here is the R code:
lappend <- function(lst, ...){
lst <- c(lst, list(...))
return(lst)
}
nSub <- 30
meas1 <- list()
meas2 <- list()
for (i in 1:nSub){
meas1 <- lappend(meas1, rnorm(10))
meas2 <- lappend(meas2, rnorm(10))
}
myList <- list(meas1 = meas1, meas2 = meas2)
Now, suppose I want a C++ function that for each subject, finds summation of measure1 and summation of measure 2 and then creates two new measurements based on these two summation. Finally the function should return these new measurements as a list.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List mySlowListFn(Rcpp::List myList, int nSub){
arma::vec myMult(nSub);
arma::vec myDiv(nSub);
for (int i = 0; i < nSub; i++){
arma::vec meas1_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas1"])[i]);
arma::vec meas2_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas2"])[i]);
myMult[i] = arma::sum(meas1_i)*arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i)/arma::sum(meas2_i);
}
return Rcpp::List::create(Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv);
}
How can I make the function above faster? I'm particularly looking for ideas that keep the input and output lists in the code (since in my own program dealing with lists is inevitable), but with some tricks to reduce some overhead time. One thing that I thought of was:
Rcpp::List mySlowListFn(const Rcpp::List& myList, int nSub)
Thanks very much for your help.
First, note that copying semantics for lists have changed in recent versions of R (definitely in latest R-devel, not sure if it made it into R 3.1.0), whereby shallow copies of lists are made, and elements within are later copied if they are modified. There is a big chance that if you are running an older version of R, then its more expensive list copying semantics are getting in the way.
That said, here's how I would re-write your function for some extra speed, with a benchmark. sourceCpp it to compare on your own machine.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List mySlowListFn(Rcpp::List myList, int nSub){
arma::vec myMult(nSub);
arma::vec myDiv(nSub);
for (int i = 0; i < nSub; i++){
arma::vec meas1_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas1"])[i]);
arma::vec meas2_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas2"])[i]);
myMult[i] = arma::sum(meas1_i)*arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i)/arma::sum(meas2_i);
}
return Rcpp::List::create(Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv);
}
// [[Rcpp::export]]
Rcpp::List myFasterListFn(Rcpp::List myList, int nSub) {
Rcpp::NumericVector myMult = Rcpp::no_init(nSub);
Rcpp::NumericVector myDiv = Rcpp::no_init(nSub);
Rcpp::List meas1 = myList["meas1"];
Rcpp::List meas2 = myList["meas2"];
for (int i = 0; i < nSub; i++) {
arma::vec meas1_i(
REAL(VECTOR_ELT(meas1, i)), Rf_length(VECTOR_ELT(meas1, i)), false, true
);
arma::vec meas2_i(
REAL(VECTOR_ELT(meas2, i)), Rf_length(VECTOR_ELT(meas2, i)), false, true
);
myMult[i] = arma::sum(meas1_i) * arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i) / arma::sum(meas2_i);
}
return Rcpp::List::create(
Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv
);
}
/*** R
library(microbenchmark)
lappend <- function(lst, ...){
lst <- c(lst, list(...))
return(lst)
}
nSub <- 30
n <- 10
meas1 <- list()
meas2 <- list()
for (i in 1:nSub){
meas1 <- lappend(meas1, rnorm(n))
meas2 <- lappend(meas2, rnorm(n))
}
myList <- list(meas1 = meas1, meas2 = meas2)
x1 <- mySlowListFn(myList, nSub)
x2 <- myFasterListFn(myList, nSub)
microbenchmark(
mySlowListFn(myList, nSub),
myFasterListFn(myList, nSub)
)
*/
gives me
> library(microbenchmark)
> lappend <- function(lst, ...){
+ lst <- c(lst, list(...))
+ return(lst)
+ }
> nSub <- 30
> n <- 10
> meas1 <- list()
> meas2 <- list()
> for (i in 1:nSub){
+ meas1 <- lappend(meas1, rnorm(n))
+ meas2 <- lappend(meas2, rnorm(n))
+ }
> myList <- list(meas1 = meas1, meas2 = meas2)
> x1 <- mySlowListFn(myList, nSub)
> x2 <- myFasterListFn(myList, nSub)
> microbenchmark(
+ mySlowListFn(myList, nSub),
+ myFasterListFn(myList, nSub)
+ )
Unit: microseconds
expr min lq median uq max neval
mySlowListFn(myList, nSub) 14.772 15.4570 16.0715 16.7520 42.628 100
myFasterListFn(myList, nSub) 4.502 5.0675 5.2470 5.8515 18.561 100
Future versions of Rcpp and Rcpp11 will have the ListOf<T> class which will make it much easier to interact with lists where we know the inner type beforehand, after the proper semantics have been ironed out.

Minimum id with non-repetitive elements

I am stuck in a difficult problem in R and am not able to resolve it. The problem goes like this.
x and y are two vectors, as given below:
x<- c(1,2,3,4,5)
y<- c(12,4,2,5,7,18,9,10)
I want to create a new vector p, where length(p) = length(x), in the following manner:
For each id in x, find the id in y which has minimum absolute distance in terms of values. For instance, for id=1 in x, value_x(id=1)=1, min_value_y =2, and id_y(value==2) = 3. Thus, the answer to id 1 in x is 3. Thus, we create a new vector p, which will have following values: p = (3,3,3,2,4);
Now we have to update p, in the following manner:
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=2. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=2. Next best minimum distance for id_x=2 is id_y=2 with value 4. Hence, updated p is (3,2,3,2,4).
As 3 has been the id corresponding to id_x=1, it can't be the id for id_x=3. Hence, we have to discard id_y =3 with value 2, to calculate the next minimum distance for id_x=3. Next best minimum distance for id_x=3 is 2. Hence, updated p is (3,2,4,2,4).
As next values in p is 2, and 4 we have to repeat what we did in the last two steps. In summary, while calculating the minimum distance between x and y, for each id of x we have to get that id of y which hasn't been previously appeared. Thus all the elements of p has to be unique.
Any answers would be appreciated.
I tried something like this, though not a complete solution:
minID <- function(x,y) {return(which(abs(x-y)==min(abs(x-y))))};
p1 <- sapply(x,minID,y=y);
#Calculates the list of all minimum elements -no where close to actual solution :(
I have a x and y over 1 million, hence for loop would be extremely slow. I am looking for a faster solution.
This can be implemented efficiently with a binary search tree on the elements of y, deleting elements as they're matched and added to p. I've implemented this using set from the stl in C++, using Rcpp to get the code into R:
library(Rcpp)
getVals = cppFunction(
'NumericVector getVals(NumericVector x, NumericVector y) {
NumericVector p(x.size());
std::vector<std::pair<double, int> > init;
for (int j=0; j < y.size(); ++j) {
init.push_back(std::pair<double, int>(y[j], j));
}
std::set<std::pair<double, int> > s(init.begin(), init.end());
for (int i=0; i < x.size(); ++i) {
std::set<std::pair<double, int> >::iterator p1, p2, selected;
p1 = s.lower_bound(std::pair<double, int>(x[i], 0));
p2 = p1;
--p2;
if (p1 == s.end()) {
selected = p2;
} else if (p2 == s.begin()) {
selected = p1;
} else if (fabs(x[i] - p1->first) < fabs(x[i] - p2->first)) {
selected = p1;
} else {
selected = p2;
}
p[i] = selected->second+1; // 1-indexed
s.erase(selected);
}
return p;
}')
Here's a runtime comparison against the pure-R solution that was posted -- the binary search tree solution is much faster and enables solutions with vectors of length 1 million in just a few seconds:
# Pure-R posted solution
getVals2 = function(x, y) {
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
return(p)
}
# Test with medium-sized vectors
set.seed(144)
x = rnorm(10000)
y = rnorm(20000)
system.time(res1 <- getVals(x, y))
# user system elapsed
# 0.008 0.000 0.008
system.time(res2 <- getVals2(x, y))
# user system elapsed
# 1.284 2.919 4.211
all.equal(res1, res2)
# [1] TRUE
# Test with large vectors
set.seed(144)
x = rnorm(1000000)
y = rnorm(2000000)
system.time(res3 <- getVals(x, y))
# user system elapsed
# 4.402 0.097 4.467
The reason for the speedup is because this approach is asymptotically faster -- if x is of size n and y is of size m, then the binary search tree approach runs in O((n+m)log(m)) time -- O(m log(m)) to construct the BST and O(n log(m)) to compute p -- while the which.min approach runs in O(nm) time.
n <- length(x)
p <- rep(NA, n)
for(i in 1:n) {
id <- which.min(abs(y - x[i]))
y[id] <- Inf
p[i] <- id
}
I have tried to develop a code in R and have gotten around 20x improvement over for loop. The piece of code goes as follows:
Generalized.getMinId <- function(a,b)
{
sapply(a, FUN = function(x) which.min(abs(x-b)))
}
Generalized.getAbsDiff <- function(a,b)
{
lapply(a, FUN = function(x) abs(x-b))
}
min_id = Generalized.getMinId(tlist,clist);
dup = which(duplicated(min_id));
while(length(dup) > 0)
{
absdiff = Generalized.getAbsDiff(tlist[dup],clist);
infind = lapply(dup, function(x,y)
{l <- head(y,x-1); l[l>0]}, y = min_id);
absdiff = Map(`[<-`, absdiff, infind, Inf);
dupind = sapply(absdiff, which.min);
min_id[dup] = dupind;
dup = which(duplicated(min_id));
}
In case someone can make an improvement over this piece of code, it would be awesome.

Resources