Rcpp sample performance when using probs - r

I'm working on an Rcpp function that involves sampling a large vector with unequal probability in a loop. I noticed that the code runs much slower when I provide the probs argument to the Rcpp sample function. Why is that?
Here are simplified functions for comparing performance with and without specifying probs to sample(). I know that the loops are not needed here in this simple example, but the question is why is one loop so much slower than the other?
library(Rcpp)
# Rcpp function for sampling without probabilities
cppFunction('IntegerVector samp_no_probs(IntegerVector x) {
IntegerVector res (x.size());
for(int i=0; i<x.size();i++){
res[i] = sample(x, 1, true)[0];
}
return res;
}')
# Rcpp function for sampling with probabilities
cppFunction('IntegerVector samp_with_probs(IntegerVector x, NumericVector probs) {
IntegerVector res (x.size());
for(int i=0; i<x.size();i++){
res[i] = sample(x, 1, true, probs)[0];
}
return res;
}')
# benchmarking
library(rbenchmark)
n <- 10000
x <- sample(1:n)
probs <- sample(runif(1:100), n, replace = T)
res <- benchmark(with_probs = samp_with_probs(x = x, probs = probs),
no_probs= samp_no_probs(x = x),
replications = 10)
And here is the result:
test replications elapsed relative
2 no_probs 10 0 NA
1 with_probs 10 17 NA

Related

Fast random sampling from matrix of cumulative probability mass functions in R

I have a matrix (mat_cdf) representing the cumulative probability an individual in census tract i moves to census tract j on a given day. Given a vector of agents who decide not to "stay home", I have a function, GetCTMove function below, to randomly sample from this matrix to determine which census tract they will spend time in.
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
# Matrix where rows correspond to probability mass of column integer
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
# Convert to cumulative probability mass
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
# Create cdf for row i
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
GetCTMove <- function(agent_cts, ct_mat_cdf){
# Expand such that every agent has its own row corresponding to CDF of movement from their home ct i to j
mat_expand <- ct_mat_cdf[agent_cts,]
# Probabilistically sample column index for every row by generating random number and then determining corresponding closest column
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Sample of 500,000 agents' residence ct
agents <- sample(1:cts, size = 500000, replace = T)
# Run function
system.time(GetCTMove(agents, mat_cdf))
user system elapsed
3.09 1.19 4.30
Working with 1 million agents, each sample takes ~10 seconds to run, multiplied by many time steps leads to hours for each simulation, and this function is by far the rate limiting factor of the model. I'm wondering if anyone has advice on faster implementation of this kind of random sampling. I've used the dqrng package to speed up random number generation, but that's relatively miniscule in comparison to the matrix expansion (mat_expand) and max.col calls which take longest to run.
The first thing that you can optimise is the following code:
max.col(s < mat_expand, "first")
Since s < mat_expand returns a logical matrix, applying the max.col function is the same as getting the first TRUE in each row. In this case, using which will be much more efficient. Also, as shown below, you store all your CDFs in a matrix.
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
This structure may not be optimal. A list structure is better for applying functions like which. It is also faster to run as you do not have to go through a do.call(rbind, ...).
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
Below is your implementation:
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
On my desktop, it takes about 2.68s to run.
> system.time(GetCTMove(agents, mat_cdf))
user system elapsed
2.25 0.41 2.68
With a list structure and a which function, the run time can be reduced by about 1s.
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
> system.time(GetCTMove2(agents, ls_cdf))
user system elapsed
1.59 0.02 1.64
To my knowledge, with R only there is no other way to further speed up the code. However, you can indeed improve the performance by re-writing the key function GetCTMove in C++. With the Rcpp package, you can do something as follows:
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
This implementation is lightning fast, which should fulfil all your needs.
> system.time(GetCTMove3(agents, ls_cdf))
user system elapsed
0.07 0.00 0.06
The full script is attached as follows:
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
agents <- sample(1:cts, size = 500000, replace = T)
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
# below is your code
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
system.time(GetCTMove(agents, mat_cdf))
system.time(GetCTMove2(agents, ls_cdf))
system.time(GetCTMove3(agents, ls_cdf))

How to improve processing time for euclidean distance calculation

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100

how to produce every permutation of 20 one (-1) in a 1-by-41 vector of ones and a simple calculation on each row?

I want to produce all permutations of 20 minus one(-1) and 21 one(1) this matrix has 269128937220 rows and 41 columns. and I want to do the following calculation on each row of this matrix:
(SLS')/4
where:
S is each row of this matrix(a 1 by 41 array).
S' is the transpose of S(a 41 by 1 array).
L is a 41 by 41 matrix
the final result of each calculation is a single number.
is there any way to produce this matrix and do the calculation without getting out of memory error and in a reasonable time?
thanks in advance.
First off, you are probably better off rethinking your approach. With that said, let's get started attacking your problem.
This is a very difficult problem mainly due to the limitations of resources. Below, I have a solution that will complete in a reasonable amount of time on a home computer given that you have access to a decent amount of storage (at least 7 TB). The algorithm below does not require that much memory and can be tuned to reduce memory usage.
Before we begin, we note that merely generating that many permutations seems impossible at first. However with the help of highly optimized C++ code and parallel computing, the task is brought back into the realm of possibility. This was demonstrated in my answer to the OP's previous question. We utilized RcppAlgos (I am the author) and the parallel package to generate about 36 million permutations per second in chunks of one million using 8 cores.
Now, we are charged with carrying out specific computations on each permutation as fast as possible. The computation is as follows:
(SLS') / 4, where S is a permutation, L is a 41 x 41 matrix
Here are a couple of base R approaches (N.B. m1[x, ] %*% m2 %*% m1[x, ] is the same as m1[x, ] %*% m2 %*% as.matrix(m1[x, ], ncol = 1)):
baseTest1 <- function(m1, m2) {
vapply(1:nrow(m1), function(x) {
m1[x, ] %*% m2 %*% m1[x, ]
}, FUN.VALUE = 1.1111, USE.NAMES = FALSE) / 4
}
baseTest2 <- function(m1, m2) {
temp <- m1 %*% m2
vapply(1:nrow(m1), function(x) {
crossprod(temp[x, ], m1[x, ])
}, FUN.VALUE = 1.1111, USE.NAMES = FALSE) / 4
}
Let's think about this a little bit. We have a bunch of permutations of the numbers one and negative one. When we multiply these permutations by a matrix of real numbers, say M, we end up simply adding and subtracting values from the M. I bet we can speed this up quite a bit using Rcpp and avoid wasteful (and useless) identity multiplications (i.e. multiplications by 1).
#include <Rcpp.h>
//[[Rcpp::export]]
Rcpp::NumericVector makeVecCpp(Rcpp::NumericMatrix A,
Rcpp::NumericMatrix B,
unsigned long int mySize) {
Rcpp::NumericVector result = Rcpp::no_init_vector(mySize);
double temp = 0;
for (std::size_t i = 0; i < mySize; ++i) {
for (std::size_t j = 0; j < 41u; ++j) {
for (std::size_t k = 0; k < 41u; ++k) {
if (A(i, j) + A(i, k)) {
temp += B(j, k);
} else {
temp -= B(j, k);
}
}
}
result[i] = temp / 4;
temp = 0;
}
return result;
}
Now let's see if they give the same results and also benchmark them:
options(scipen = 999)
library(RcppAlgos)
library(microbenchmark)
set.seed(42)
M <- matrix(rnorm(41*41), nrow = 41, ncol = 41)
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20), upper = 100000)
all.equal(baseTest1(negOne, M), baseTest2(negOne, M))
# [1] TRUE
all.equal(baseTest1(negOne, M), makeVecCpp(negOne, M, 100000))
# [1] TRUE
microbenchmark(base1 = baseTest1(negOne, M), base2 = baseTest2(negOne, M),
myRcpp = makeVecCpp(negOne, M, 100000), times = 25)
Unit: milliseconds
expr min lq mean median uq max neval
base1 555.0256 582.2273 597.6447 593.7708 599.1380 690.3882 25
base2 471.0251 494.2367 541.2632 531.1858 586.6774 632.7279 25
myRcpp 202.7637 207.2463 210.0255 209.0399 209.9648 240.6664 25
Our Rcpp implementation is the clear winner!! Moving on, we incorportate this into our final answer:
## WARNING Don't run this unless you have a few DAYS on your hand
library(parallel)
## break up into even intervals of one hundred thousand
firstPart <- mclapply(seq(1, 269128900000, 100000), function(x) {
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = x, upper = x + 99999)
vals <- makeVecCpp(negOne, M, 100000)
write.csv(vals, paste0("myFile", x, ".csv", collapse = ""))
x
}, mc.cores = 8)
## get the last few results and complete analysis
lastPart <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = 269128900001, upper = 269128937220)
vals <- makeVecCpp(lastPart, M, 37220)
write.csv(vals, paste0("myFile", 269128900001, ".csv", collapse = ""))
You will note that we avoid storing everything in memory by writing every one hundred thousand results to main storage hence the need for a huge hard drive. When I tested this, each file was about 2.5 Mb which would total to about 6.5 TB:
a <- 2.5 * (2^20) ### convert to bytes
a * (269128937220 / 1e5) / 2^40 ## get terabytes
[1] 6.416534
To give you an idea of how long this compuation will take, here is a timing for the first one hundred million results:
system.time(firstPart <- mclapply(seq(1, 100000000, 100000), function(x) {
negOne <- permuteGeneral(c(1L, -1L), freqs = c(21, 20),
lower = x, upper = x + 99999)
vals <- makeVecCpp(negOne, M, 100000)
write.csv(vals, paste0("myFile", x, ".csv", collapse = ""))
x
}, mc.cores = 8))
user system elapsed
529.931 9.557 80.690
80 seconds ain't that bad! That means we will only have to wait around for about 2.5 days!!!!!:
(269128937220 / 100000000 / 60 / 60 / 24) * 80
[1] 2.491935
If you really want to reduce this time, you will have to utilize a high performance computing service.
All results were obtained on a MacBook Pro 2.8GHz quad core (with 4 virtual cores.. 8 total).
First note that the result you expect is a numerical vector with more than 269 billion elements. You will need 8 bytes per element, i.e. more than 2TB RAM just to store the result. If you don't have that much, it's hopeless to do what you ask for. Note also that you will need a long vector to store the result.
If you do have this amount of RAM, here is a solution based on combn with its FUN argument. This should be fairly optimal for the memory use. If you want to make it faster, try to implement compute_one directly with Rcpp.
k = 15 # should be 20
n = 2*k+1
L = matrix(runif(n*n), ncol=n)
compute_one = function(indices) {
s = rep.int(1,n)
s[indices] = -1
drop(t(s) %*% L %*% s / 4)
}
res = combn(n, k, compute_one)

My C++ functions with Rcpp::List inputs are very slow

While C++ and specifically the Rcpp package have been tremendously helpful to me in speeding up my codes, I noticed that my C++ functions which have a list or data frame input argument(arguments of the form Rcpp::DataFrame and Rcpp::List) are very slower compared to my other C++ functions. I wrote a sample code and I wanted to ask for tricks that can make my code faster:
First, let's simulate a List in R that contains two Lists inside of it. Consider myList as a list that includes two lists - measure1 and measure2. measure1 and measure2 are lists themselves each include vectors of measurements for subjects. Here is the R code:
lappend <- function(lst, ...){
lst <- c(lst, list(...))
return(lst)
}
nSub <- 30
meas1 <- list()
meas2 <- list()
for (i in 1:nSub){
meas1 <- lappend(meas1, rnorm(10))
meas2 <- lappend(meas2, rnorm(10))
}
myList <- list(meas1 = meas1, meas2 = meas2)
Now, suppose I want a C++ function that for each subject, finds summation of measure1 and summation of measure 2 and then creates two new measurements based on these two summation. Finally the function should return these new measurements as a list.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List mySlowListFn(Rcpp::List myList, int nSub){
arma::vec myMult(nSub);
arma::vec myDiv(nSub);
for (int i = 0; i < nSub; i++){
arma::vec meas1_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas1"])[i]);
arma::vec meas2_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas2"])[i]);
myMult[i] = arma::sum(meas1_i)*arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i)/arma::sum(meas2_i);
}
return Rcpp::List::create(Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv);
}
How can I make the function above faster? I'm particularly looking for ideas that keep the input and output lists in the code (since in my own program dealing with lists is inevitable), but with some tricks to reduce some overhead time. One thing that I thought of was:
Rcpp::List mySlowListFn(const Rcpp::List& myList, int nSub)
Thanks very much for your help.
First, note that copying semantics for lists have changed in recent versions of R (definitely in latest R-devel, not sure if it made it into R 3.1.0), whereby shallow copies of lists are made, and elements within are later copied if they are modified. There is a big chance that if you are running an older version of R, then its more expensive list copying semantics are getting in the way.
That said, here's how I would re-write your function for some extra speed, with a benchmark. sourceCpp it to compare on your own machine.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List mySlowListFn(Rcpp::List myList, int nSub){
arma::vec myMult(nSub);
arma::vec myDiv(nSub);
for (int i = 0; i < nSub; i++){
arma::vec meas1_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas1"])[i]);
arma::vec meas2_i = Rcpp::as<arma::vec>(Rcpp::as<Rcpp::List>(myList["meas2"])[i]);
myMult[i] = arma::sum(meas1_i)*arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i)/arma::sum(meas2_i);
}
return Rcpp::List::create(Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv);
}
// [[Rcpp::export]]
Rcpp::List myFasterListFn(Rcpp::List myList, int nSub) {
Rcpp::NumericVector myMult = Rcpp::no_init(nSub);
Rcpp::NumericVector myDiv = Rcpp::no_init(nSub);
Rcpp::List meas1 = myList["meas1"];
Rcpp::List meas2 = myList["meas2"];
for (int i = 0; i < nSub; i++) {
arma::vec meas1_i(
REAL(VECTOR_ELT(meas1, i)), Rf_length(VECTOR_ELT(meas1, i)), false, true
);
arma::vec meas2_i(
REAL(VECTOR_ELT(meas2, i)), Rf_length(VECTOR_ELT(meas2, i)), false, true
);
myMult[i] = arma::sum(meas1_i) * arma::sum(meas2_i);
myDiv[i] = arma::sum(meas1_i) / arma::sum(meas2_i);
}
return Rcpp::List::create(
Rcpp::Named("myMult") = myMult,
Rcpp::Named("myDiv") = myDiv
);
}
/*** R
library(microbenchmark)
lappend <- function(lst, ...){
lst <- c(lst, list(...))
return(lst)
}
nSub <- 30
n <- 10
meas1 <- list()
meas2 <- list()
for (i in 1:nSub){
meas1 <- lappend(meas1, rnorm(n))
meas2 <- lappend(meas2, rnorm(n))
}
myList <- list(meas1 = meas1, meas2 = meas2)
x1 <- mySlowListFn(myList, nSub)
x2 <- myFasterListFn(myList, nSub)
microbenchmark(
mySlowListFn(myList, nSub),
myFasterListFn(myList, nSub)
)
*/
gives me
> library(microbenchmark)
> lappend <- function(lst, ...){
+ lst <- c(lst, list(...))
+ return(lst)
+ }
> nSub <- 30
> n <- 10
> meas1 <- list()
> meas2 <- list()
> for (i in 1:nSub){
+ meas1 <- lappend(meas1, rnorm(n))
+ meas2 <- lappend(meas2, rnorm(n))
+ }
> myList <- list(meas1 = meas1, meas2 = meas2)
> x1 <- mySlowListFn(myList, nSub)
> x2 <- myFasterListFn(myList, nSub)
> microbenchmark(
+ mySlowListFn(myList, nSub),
+ myFasterListFn(myList, nSub)
+ )
Unit: microseconds
expr min lq median uq max neval
mySlowListFn(myList, nSub) 14.772 15.4570 16.0715 16.7520 42.628 100
myFasterListFn(myList, nSub) 4.502 5.0675 5.2470 5.8515 18.561 100
Future versions of Rcpp and Rcpp11 will have the ListOf<T> class which will make it much easier to interact with lists where we know the inner type beforehand, after the proper semantics have been ironed out.

Greedy optimization in R

I am trying to replicate Caruana et al.'s method for Ensemble selection from libraries of models (pdf). At the core of the method is a greedy algorithm for adding models to the ensemble (models can be added more than once). I've written an implementation for this greedy optimization algorithm, but it is very slow:
library(compiler)
set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)
greedOpt <- cmpfun(function(X, Y, iter=100){
weights <- rep(0, ncol(X))
while(sum(weights) < iter) {
errors <- sapply(1:ncol(X), function(y){
newweights <- weights
newweights[y] <- newweights[y] + 1
pred <- X %*% (newweights)/sum(newweights)
error <- Y - pred
sqrt(mean(error^2))
})
update <- which.min(errors)
weights[update] <- weights[update]+1
}
return(weights/sum(weights))
})
system.time(a <- greedOpt(X,Y))
I know R doesn't do loops well, but I can't think of any way to do this type of stepwise search without a loop.
Any suggestions for improving this function?
Here is an R implementation that is 30% faster than yours. Not as fast as your Rcpp version but maybe it will give you ideas that combined with Rcpp will speed things further. The two main improvements are:
the sapply loop has been replaced by a matrix formulation
the matrix multiplication has been replaced by a recursion
greedOpt <- cmpfun(function(X, Y, iter = 100L){
N <- ncol(X)
weights <- rep(0L, N)
pred <- 0 * X
sum.weights <- 0L
while(sum.weights < iter) {
sum.weights <- sum.weights + 1L
pred <- (pred + X) * (1L / sum.weights)
errors <- sqrt(colSums((pred - Y) ^ 2L))
best <- which.min(errors)
weights[best] <- weights[best] + 1L
pred <- pred[, best] * sum.weights
}
return(weights / sum.weights)
})
Also, I maintain you should try upgrading to the atlas library. You might see significant improvements.
I took a shot at writing an Rcpp version of this function:
library(Rcpp)
cppFunction('
NumericVector greedOptC(NumericMatrix X, NumericVector Y, int iter) {
int nrow = X.nrow(), ncol = X.ncol();
NumericVector weights(ncol);
NumericVector newweights(ncol);
NumericVector errors(nrow);
double RMSE;
double bestRMSE;
int bestCol;
for (int i = 0; i < iter; i++) {
bestRMSE = -1;
bestCol = 1;
for (int j = 0; j < ncol; j++) {
newweights = weights + 0;
newweights[j] = newweights[j] + 1;
newweights = newweights/sum(newweights);
NumericVector pred(nrow);
for (int k = 0; k < ncol; k++){
pred = pred + newweights[k] * X( _, k);
}
errors = Y - pred;
RMSE = sqrt(mean(errors*errors));
if (RMSE < bestRMSE || bestRMSE==-1){
bestRMSE = RMSE;
bestCol = j;
}
}
weights[bestCol] = weights[bestCol] + 1;
}
weights = weights/sum(weights);
return weights;
}
')
It's more than twice as fast as the R version:
set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)
> system.time(a <- greedOpt(X, Y, 1000))
user system elapsed
36.19 6.10 42.40
> system.time(b <- greedOptC(X, Y, 1000))
user system elapsed
16.50 1.44 18.04
> all.equal(a,b)
[1] TRUE
Not bad, but I was hoping for a bigger speedup when making the leap from R to Rcpp. This is one of the first Rcpp functions I've ever written, so perhaps further optimization is possible.

Resources