R: Average sequential values when their difference is below some threshold - r

I'd like to take an increasing sequence of numbers (e.g. a series of times)
set.seed(41); d <- seq(1:100) + runif(100, 0, 1)
and if the difference between two sequential numbers is below a threshold, merge them into a single point by taking the mean value of the two and, then continue going through until the next time combining is necessary. I resorted to functions I usually avoid: while and ifelse to write a quick-and-dirty function, and it works but isn't fast. Can you solve this task 1) more efficiently and 2) without invoking a for or while loop. Is there some built-in function, perhaps with even more functionality, that is well-suited for such a task?
combine_points <- function(x, th=0.5)
{
i = 1 # start i at 1
while(min(diff(x)) < th) # initiate while loop
{
ifelse(x[i+1] - x[i] < th, # logical condition
x[i] <- x[i+1] <-
mean(c(x[i+1], x[i])), # assignment if TRUE
(x[i] <- x[i])) # assignment if FALSE
x <- sort(unique(x)) # get rid of the duplicated entry created when
# the ifelse statement was TRUE
# increment i or reset i to 1 if it gets too large
ifelse(i == length(x), i <- 1, i <- i+1 )
}
return(x)
}
newd <- combine_points(d)
th <- 0.5
which(diff(newd) < th)
integer(0)
Update to benchmarks of solutions so far.
I benchmarked with a larger sample vector, and the Rcpp solution suggested by #Roland is slower than my first while loop when the vector gets long. I made an improvement to the initial while loop, and made an Rcpp version of it, too. The benchmark results are below. Note that #flodel answer is not directly comparable because it is a fundamentally different approach to combining, but it is definitely very fast.
set.seed(41); d <- seq(1:4000) + runif(4000, 0, 1)
library(microbenchmark)
microbenchmark(
combine_points.Frank(d,th=0.5),
combine_points.Frank2(d,th=0.5),
combine_points_Roland(d,th=0.5),
combine_points_Roland2(d,th=0.5))
Unit: milliseconds
expr min lq median uq max neval
combine_points.Frank(d, th = 0.5) 2115.6391 2154.5038 2174.5889 2193.8444 7884.1638 100
combine_points.Frank2(d, th = 0.5) 1298.2923 1323.2214 1341.5357 1357.4260 15538.0872 100
combine_points_Roland(d, th = 0.5) 2497.9106 2506.5960 2512.3591 2519.0036 2573.2854 100
combine_points_Roland2(d, th = 0.5) 494.8406 497.3613 498.2347 499.8777 544.9743 100
This is a considerable improvement over my first attempt, and the following is an Rcpp version, which is the fastest, so far:
combine_points.Frank2 <- function(x, th=0.5)
{
i = 1
while(min(diff(x)) < th)
{
if(x[i+1] - x[i] >= th){
i <- i + 1}
else {
x[i] <- x[i+1] <-
mean(c(x[i+1], x[i]));x <- unique(x); i <- i }
}
return(x)
}
Rcpp version
cppFunction('
NumericVector combine_points_Roland2(NumericVector x, double th) {
int i=0;
while(min(diff(x)) < th)
{
if ((x[i+1] - x[i]) >= th)
{
i = i + 1;
}
else{
x[i] = (x[i+1] + x[i])/2;
x[i+1] = x[i];
x = sort_unique(x);
i = i;
}
}
return x;
}
')

Here is something faster. It avoids resizing/copying x in the loop.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector combine_points_Roland3(NumericVector x, double th) {
int i=0, j;
int n(x.size());
while(i < n-1)
{
if ((x[i+1] - x[i]) >= th)
{
i = i + 1;
}
else{
x[i] = (x[i+1] + x[i])/2;
n = n-1;
for (j=i+1; j<n; j++)
{
x[j]=x[j+1];
}
}
}
NumericVector y(n);
for (i = 0; i < n; i++) {
y[i] = x[i];
}
return y;
}
An R implementation of the same algorithm:
combine_points_Roland3R <- function(x, th) {
i <- 1
n <- length(x)
while(i < n) {
if ((x[i+1] - x[i]) >= th) {
i <- i + 1;
} else {
x[i] <- (x[i+1] + x[i])/2
n <- n-1
x[(i+1):n] <- x[(i+2):(n+1)]
}
}
x[1:n]
}
set.seed(41); d <- seq(1:4000) + runif(4000, 0, 1)
x2 <- combine_points_Roland2(d, 0.5)
x3 <- combine_points_Roland3(d, 0.5)
all.equal(x2, x3)
#TRUE
x4 <- combine_points_Roland3R(d, 0.5)
all.equal(x2, x4)
#TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(combine_points_Roland2(d, 0.5),
combine_points_Roland3(d, 0.5),
combine_points_Roland3R(d, 0.5))
# Unit: microseconds
# expr min lq median uq max neval
# combine_points_Roland2(d, 0.5) 126458.64 131414.592 132355.4285 133422.2235 147306.728 100
# combine_points_Roland3(d, 0.5) 121.34 128.269 140.8955 143.3595 393.582 100
# combine_points_Roland3R(d, 0.5) 17564.24 18626.878 19155.6565 20910.2935 68707.888 100

See if this does what you want:
combine_points <- function(x, th=0.5) {
group <- cumsum(c(FALSE, diff(x) > th))
unname(sapply(split(x, group), mean))
}
combine_points(c(-1, 0.1, 0.2, 0.3, 1, 1.5, 2.0, 2.5, 3.0, 10), 0.5)
# [1] -1.0 0.2 2.0 10.0

Here is a translation of your function into Rcpp. It uses sugar functions, which are very convenient, but often there are faster alternatives (RcppEigen or RcppArmadillo are good for that). And of course the algorithm could be improved.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector combine_points1(NumericVector x, double th) {
int i=0;
while(min(diff(x)) < th)
{
if ((x[i+1] - x[i]) < th)
{
x[i] = (x[i+1] + x[i])/2;
x[i+1] = x[i];
}
x = sort_unique(x);
if(i <= x.size())
{
i = i+1;
}
else {
i=1;
}
}
return x;
}
I recommend using RStudio for writing Rcpp functions and sourcing them.
all.equal(combine_points1(d, 0.5),
combine_points(d, 0.5))
#[1] TRUE
library(compiler)
combine_points_comp <- cmpfun(combine_points)
library(microbenchmark)
microbenchmark(combine_points1(d, 0.5),
combine_points_comp(d, 0.5),
combine_points(d, 0.5))
# Unit: microseconds
# expr min lq median uq max neval
# combine_points1(d, 0.5) 652.772 664.6815 683.1315 714.653 1030.171 100
# combine_points_comp(d, 0.5) 8344.839 8692.0880 9010.1470 10627.049 14117.553 100
# combine_points(d, 0.5) 8996.768 9371.0805 9687.0235 10560.226 12800.831 100
A speed-up by a factor of 14 without real effort.

Related

Faster alternative to for loops in R?

I am trying to calculate a transition probability (without Markov assumption), which require calculating this nested integration,
Note that the integrals can be replace by summations in my case. Here is a toy example code I am using to calculate this,
# simulate some data
set.seed(99)
data<-data.frame(time=seq(0,7,0.1),
S_D=seq(1,0.95,length.out = 71),
lam12=sample(c(0,0.1,0.12,0.15,0.17),size = 71,replace = TRUE),
lam23=sample(c(0,0.05,0.1,0.08,0.12),size = 71,replace = TRUE),
lam24=sample(c(0,0.02,0.05,0.06,0.08),size = 71,replace = TRUE))
prob_123<-c() # initializing a NULL vector
end<-nrow(data)
for (j in 2: end)
{
# j indicates u in the expresstion
# k indicates v in the expression
prob_123k<-0
for (k in (j+1):end)
{
if (k==(j+1)){
prob_123k<-prob_123k+data$S_D[j-1]*data$lam12[j]*data$lam12[k-j]
}
if (k>(j+1)){
prob_123k<-prob_123k+data$S_D[j-1]*data$lam12[j]*prod(1-(data$lam12[1:(k-j-1)]+data$lam24[1:(k-j-1)]))*(data$lam12[k-j])
}
}
prob_123[j-1]<-prob_123k
}
sum(prob_123) # result = 5.631623
In the code, S_D corresponds to expression exp{-(\Lambda_12(u)+\Lambda_13(u)+\Lambda_14(u))} and prod(1-(...)) corresponds to expression exp{-(\Lambda_23(v-u)+\Lambda_24(v-u))}. My original dataset is much bigger than this and it takes a long time calculate the nested for loops. Can anyone please suggest any faster alternatives? Thank you so much.
You could use a combination of cumprod and sum to get the loops faster.
Encapsulating your for-loop into a function f:
f <- function(data) {
end <- nrow(data)
prob_123 <- vector("numeric", end) # initializing a NULL vector
for (j in 2:end) {
# j indicates u in the expresstion
# k indicates v in the expression
prob_123k <- 0
for (k in (j + 1):end) {
if (k == (j + 1)) {
prob_123k <- prob_123k + data$S_D[j - 1] * data$lam12[j] * data$lam12[k - j]
}
if (k > (j + 1)) {
prob_123k <- prob_123k + data$S_D[j - 1] * data$lam12[j] * data$lam12[k - j] *
prod(1 - (data$lam12[1:(k - j - 1)] + data$lam24[1:(k - j - 1)]))
}
}
prob_123[j - 1] <- prob_123k
}
sum(prob_123)
}
f(data)
5.631623
We note that there's an unnecessary re-computation of the running product - it can be computed once and then indexed appropriately. Rewriting the function as follows:
ff <- function(data) {
end <- nrow(data)
prob_123 <- vector("numeric", end) # initializing a NULL vector
p <- cumprod(1 - (data$lam12 + data$lam24))
for (j in 2:end) {
# j indicates u in the expresstion
# k indicates v in the expression
if (j + 1 <= end) {
prob_123k <- data$S_D[j - 1] * data$lam12[j] * data$lam12[(j + 1):end - j] *
c(p[1], p[1:(end - j - 1)])
}
prob_123[j - 1] <- sum(prob_123k) + # Equivalent to the k > j + 1 part
data$S_D[j - 1] * data$lam12[j] * data$lam12[j + 1 - j] # Equivalent to k = j + 1 part
}
sum(prob_123)
}
identical(f(data), ff(data))
TRUE
We can then use the microbenchmark package to see if there's an improvement:
library(microbenchmark)
microbenchmark(f(data), ff(data))
Unit: microseconds
expr min lq mean median uq max neval
f(data) 8853.501 9020.902 10118.14 9410.351 10416.201 14392.401 100
ff(data) 344.701 356.401 373.86 367.301 384.701 466.401 100
The function ff is about 30x faster on average.
I'm sure code jockeys can optimise this even more, by getting rid of j for-loop, perhaps?

How to improve processing time for euclidean distance calculation

I'm trying to calculate the weighted euclidean distance (squared) between twoo data frames that have the same number of columns (variables) and different number of rows (observations).
The calculation follows the formula:
DIST[m,i] <- sum(((DATA1[m,] - DATA2[i,]) ^ 2) * lambda[1,])
I specifically need to multiply each parcel of the somatory by a specific weight (lambda).
The code provided bellow runs correctly, but if I use it in hundreds of iterations it takes a lot of processing time. Yesterday it took me 18 hours to create a graphic using multiple iterations of a function that contains this calculation. Using library(profvis) profvis({ my code }) I saw that this specific part of the code is taking up like 80% of the processing time.
I read a lot about how to reduce the processing time using parallel and vectorized operations, but I don't know how to implement them in this particular case, because of the weight lamb#.
Can some one help me reduce my processing time with this code?
More information about the code and the structure of the data can be found in the code provided bellow as comments.
# Data frames used to calculate the euclidean distances between each observation
# from DATA1 and each observation from DATA2.
# The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting
# in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
# Weights used for each of the 50 variables to calculate the weighted
# euclidean distance.
# Can be a vector of different weights or a scalar of the same weight
# for all variables.
lambda <- runif(n=50, min=0, max=10) ## length(lambda) > 1
# lambda=1 ## length(lambda) == 1
if (length(lambda) > 1) {
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
}
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
# Euclidean Distance calculation
DIST <- matrix(NA, nrow=nrows1, ncol=nrows2 )
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}
After all the sugestions, combining the answers from #MDWITT (for length(lambda > 1) and #F. Privé (for length(lambda == 1) the final solution took only one minute to run, whilst the original one took me an hour and a half to run, in a bigger code that has that calculation. The final code for this problem, for those interested, is:
#Data frames used to calculate the euclidean distances between each observation from DATA1 and each observation from DATA2.
#The euclidean distance is between a [600x50] and a [8X50] dataframes, resulting in a [600X8] dataframe.
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
#Weights used for each of the 50 variables to calculate the weighted euclidean distance.
#Can be a vector of different weights or a scalar of the same weight for all variables.
#lambda <- runif(n = 50, min = 0, max = 10) ##length(lambda) > 1
lambda = 1 ##length(lambda) == 1
nrows1 <- nrow(DATA1)
nrows2 <- nrow(DATA2)
#Euclidean Distance calculation
DIST <- matrix(NA, nrow = nrows1, ncol = nrows2)
if (length(lambda) > 1){
as.numeric(unlist(lambda))
lambda <- as.matrix(lambda)
lambda <- t(lambda)
library(Rcpp)
cppFunction('NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix DIST(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
DIST(i,j) = d;
}
}
return (DIST) ;
}')
DIST <- weighted_distance(DATA1, DATA2, lambda = lambda)}
if (length(lambda) == 1) {
DIST <- outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
}
Rewrite the problem to use linear algebra and vectorization, which is much faster than loops.
If you don't have lambda, this is just
outer(rowSums(DATA1^2), rowSums(DATA2^2), '+') - tcrossprod(DATA1, 2 * DATA2)
With lambda, it becomes
outer(drop(DATA1^2 %*% lambda), drop(DATA2^2 %*% lambda), '+') -
tcrossprod(DATA1, sweep(DATA2, 2, 2 * lambda, '*'))
Here an alternate way using Rcpp just to have this concept documents. In a file called euclidean.cpp in it I have
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix weighted_distance (NumericMatrix x, NumericMatrix y, NumericVector lambda){
int n_x = x.nrow();
int n_y = y.nrow();
NumericMatrix out(n_x, n_y);
//begin the loop
for (int i = 0 ; i < n_x; i++){
for (int j = 0 ; j < n_y ; j ++) {
double d = sum(pow(x.row(i) - y.row(j), 2)*lambda);
out(i,j) = d;
}
}
return (out) ;
}
In R, then I have
library(Rcpp)
sourceCpp("libs/euclidean.cpp")
# Generate Data
DATA1 <- matrix(rexp(30000, rate=.1), ncol=50) #[600x50]
DATA2 <- matrix(rexp(400, rate=.1), ncol=50) #[8X50]
lambda <- runif(n=50, min=0, max=10)
# Run the program
out <- weighted_distance(DATA1, DATA2, lambda = lambda)
When I test the speed using:
microbenchmark(
Rcpp_way = weighted_distance(DATA1, DATA2, lambda = lambda),
other = {DIST <- matrix(NA, nrow=nrows1, ncol=ncols)
for (m in 1:nrows1) {
for (i in 1:nrows2) {
if (length(lambda) == 1) {
DIST[m, i] <- sum((DATA1[m, ] - DATA2[i, ])^2)
}
if (length(lambda) > 1){
DIST[m, i] <- sum(((DATA1[m, ] - DATA2[i, ])^2) * lambda[1, ])
}
next
}
next
}}, times = 100)
You can see that it is a good clip faster:
Unit: microseconds
expr min lq mean median uq max neval
Rcpp_way 446.769 492.308 656.9849 562.667 846.9745 1169.231 100
other 24688.821 30681.641 44153.5264 37511.385 50878.3585 200843.898 100

Speeding up repeated generation of Exponential random variables in loop

I am implementing an algorithm, and as part of that, I need to generate exponential random variables. Unfortunately though, I can't really avoid looping, as each generated random variable depends on the previous one so I think vectorisation is out of the question. There are some calculations that I do around the generation, but the bottleneck (at present) is the generation. At this point I am assuming N will be large (N >= 1,000,000).
Here is some example code:
N <- 1e7
#Preallocate
x <- rep(0, times=N)
#Set a starting seed
x[1] <- runif(1)
for(i in 2:N) {
#Do some calculations
x[i] <- x[i-1] + rexp(1, x[i-1]) #Bottleneck
#Do some more calculations
}
How can I speed this up? I've tried implementing in Rcpp, but it doesn't seem to do much in this case. Is there another clever way I can get around the rexp() call in each iteration?
We can use the fact that if X ~ Exp(λ) then kX ~ Exp(λ/k) (source: Wikipedia) to speed up the code. This way we can do all the random draws with rate = 1 up front and then just divide within the loop to scale them appropriately.
draws = rexp(N, rate = 1)
x <- rep(0, times = N)
x[1] <- runif(1)
for(i in 2:N) {
#Do some calculations
x[i] <- x[i-1] + draws[i] / x[i-1]
#Do some more calculations
}
A microbenchmark with N = 1e6 values show this to be about 14x faster:
N <- 1e6
draws = rexp(N, rate = 1)
x <- rep(0, times = N)
x[1] <- runif(1)
microbenchmark::microbenchmark(
draw_up_front = {
draws = rexp(N, rate = 1)
for (i in 2:N)
x[i] <- x[i - 1] + draws[i] / x[i - 1]
},
draw_one_at_time = {
for (i in 2:N)
x[i] <- x[i - 1] + rexp(1, x[i - 1])
},
times = 10
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# draw_up_front 153.9547 156.6552 159.9622 160.1901 161.9803 167.2831 10 a
# draw_one_at_time 2207.1997 2212.0460 2280.1265 2236.5197 2332.9913 2478.5104 10 b
A brute-force Rcpp solution:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector genExp(int N) {
NumericVector res(N);
double prev;
res[0] = unif_rand();
for (int i = 1; i < N; i++) {
prev = res[i-1];
res[i] = prev + exp_rand() / prev;
}
return res;
}
Microbenchmark with N = 1e6:
Unit: milliseconds
expr min lq mean median uq max neval
draw_up_front 167.17031 168.57345 170.62292 170.18072 171.73782 175.46868 20
draw_one_at_time 1415.01898 1465.57139 1510.81220 1502.15753 1550.07829 1623.70831 20
rcpp 28.25466 29.33682 33.52528 29.89636 30.74908 94.38009 20
With N = 1e7:
Unit: milliseconds
expr min lq mean median uq max neval
draw_up_front 1698.730 1708.739 1737.8633 1716.1345 1752.3276 1923.3940 20
rcpp 297.142 319.794 338.6755 327.6626 364.6308 398.1554 20

Complicated double sum using "outer"

I have to calculate the following in R
where kip, c are constants. One way of doing this is like:
xfun<- function(x,k,p,c){
ghhh<-numeric()
for(i in 1: length(x)){
ghhh[i]<-sum(k/(x[i]-x[1:(i-1)]+c)^p)
}
res<-sum(log(ghhh))
return(res)
}
. But can I calculate this using "outer" ? So that it becomes faster?
The data is like:
t <- numeric(2000)
t[1]<-0
for (i in 2:2000){
t[i]<- t[i-1]+rexp(1, 0.2)
}
k=0.5; p=1.2; c=0.02
Your equation is a bit confusing. I'm not sure what should happen in the inner sum if i == 1. Sum from 1 to zero?
Based on some guessing (if I guessed wrong, you need to adjust the following), I suspect your function should be corrected to this:
xfun<- function(x,k,p,c){
ghhh<-numeric() # it would be better practice to use numeric(length(x) - 1)
for(i in 1: (length(x) - 1)){
ghhh[i]<-sum(k/(x[i+1]-x[1:i]+c)^p)
}
res<-sum(log(ghhh))
return(res)
}
t <- numeric(2000)
t[1]<-0
set.seed(42)
for (i in 2:2000){
t[i]<- t[i-1]+rexp(1, 0.2)
}
k=0.5; p=1.2; c=0.02
xfun(t, k, p, c)
#[1] -1526.102
Rewritten with outer:
xfun1 <- function(x ,k ,p ,c){
o <- outer(seq_along(x), seq_along(x), function(i, j) {
res <- numeric(length(i))
test <- j < i
res[test] <- k / (x[i[test]] - x[j[test]] + c) ^ p
res
})
sum(log(rowSums(o)[-1]))
}
xfun1(t, k, p, c)
#[1] -1526.102
Benchmarking:
library(microbenchmark)
microbenchmark(loop = xfun(t, k, p, c),
outer = xfun1(t, k, p, c),
times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# loop 186.0454 186.2375 188.9567 187.4005 189.0597 196.6992 10 a
# outer 263.4137 274.6610 346.4505 344.6918 423.3651 425.2885 10 b
As you see, the solution with outer is not faster for data of this size. The main reasons are that R needs to allocate memory for a vector of length 2000^2 and work on this large vector. Also, your simple loop is optimized by the JIT bytecode compiler.
If you want to be faster, you need to switch to a compiled language. Luckily, this is rather easy with Rcpp:
library(Rcpp)
library(inline)
cppFunction(
'double xfun2(const NumericVector x, const double k, const double p, const double c) {
int n = x.length() - 1;
NumericVector ghhh(n);
for (int i = 0; i < n; ++i) {
for (int j = 0; j <= i ; ++j) {
ghhh(i) += k / pow(x(i + 1) - x(j) + c, p);
}
}
ghhh = log(ghhh);
double res;
res = sum(ghhh);
return res;
}')
xfun2(t, k, p, c)
#[1] -1526.102
microbenchmark(loop = xfun(t, k, p, c),
outer = xfun1(t, k, p, c),
Rcpp = xfun2(t, k, p, c),
times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# loop 186.0395 188.7875 189.7487 189.9298 191.6967 192.7213 10 b
# outer 408.4452 416.7730 432.3356 419.7510 422.4000 559.4279 10 c
# Rcpp 136.1496 136.1606 136.1929 136.1762 136.2129 136.3089 10 a
As you see, speed gains are minimal for data of this size (JIT compilation is truly marvelous). I suggest to stay with your R loop.
Considering that the logic you implement is the correct one you could try the parallel R functionalities:
library(foreach)
library(doParallel)
xfun2<- function(x,k,p,c){
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
ghhh <- foreach(i = 1: length(x), .combine = c) %dopar% sum(k/(x[i]-x[1:(i-1)]+c)^p)
res <- sum(log(ghhh))
}
I ran it with x <- rnorm(100000, 1, 0.5) and the parallel version was almost twice as fast.
You can read more about the doParallel package here

Greedy optimization in R

I am trying to replicate Caruana et al.'s method for Ensemble selection from libraries of models (pdf). At the core of the method is a greedy algorithm for adding models to the ensemble (models can be added more than once). I've written an implementation for this greedy optimization algorithm, but it is very slow:
library(compiler)
set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)
greedOpt <- cmpfun(function(X, Y, iter=100){
weights <- rep(0, ncol(X))
while(sum(weights) < iter) {
errors <- sapply(1:ncol(X), function(y){
newweights <- weights
newweights[y] <- newweights[y] + 1
pred <- X %*% (newweights)/sum(newweights)
error <- Y - pred
sqrt(mean(error^2))
})
update <- which.min(errors)
weights[update] <- weights[update]+1
}
return(weights/sum(weights))
})
system.time(a <- greedOpt(X,Y))
I know R doesn't do loops well, but I can't think of any way to do this type of stepwise search without a loop.
Any suggestions for improving this function?
Here is an R implementation that is 30% faster than yours. Not as fast as your Rcpp version but maybe it will give you ideas that combined with Rcpp will speed things further. The two main improvements are:
the sapply loop has been replaced by a matrix formulation
the matrix multiplication has been replaced by a recursion
greedOpt <- cmpfun(function(X, Y, iter = 100L){
N <- ncol(X)
weights <- rep(0L, N)
pred <- 0 * X
sum.weights <- 0L
while(sum.weights < iter) {
sum.weights <- sum.weights + 1L
pred <- (pred + X) * (1L / sum.weights)
errors <- sqrt(colSums((pred - Y) ^ 2L))
best <- which.min(errors)
weights[best] <- weights[best] + 1L
pred <- pred[, best] * sum.weights
}
return(weights / sum.weights)
})
Also, I maintain you should try upgrading to the atlas library. You might see significant improvements.
I took a shot at writing an Rcpp version of this function:
library(Rcpp)
cppFunction('
NumericVector greedOptC(NumericMatrix X, NumericVector Y, int iter) {
int nrow = X.nrow(), ncol = X.ncol();
NumericVector weights(ncol);
NumericVector newweights(ncol);
NumericVector errors(nrow);
double RMSE;
double bestRMSE;
int bestCol;
for (int i = 0; i < iter; i++) {
bestRMSE = -1;
bestCol = 1;
for (int j = 0; j < ncol; j++) {
newweights = weights + 0;
newweights[j] = newweights[j] + 1;
newweights = newweights/sum(newweights);
NumericVector pred(nrow);
for (int k = 0; k < ncol; k++){
pred = pred + newweights[k] * X( _, k);
}
errors = Y - pred;
RMSE = sqrt(mean(errors*errors));
if (RMSE < bestRMSE || bestRMSE==-1){
bestRMSE = RMSE;
bestCol = j;
}
}
weights[bestCol] = weights[bestCol] + 1;
}
weights = weights/sum(weights);
return weights;
}
')
It's more than twice as fast as the R version:
set.seed(42)
X <- matrix(runif(100000*10), ncol=10)
Y <- rnorm(100000)
> system.time(a <- greedOpt(X, Y, 1000))
user system elapsed
36.19 6.10 42.40
> system.time(b <- greedOptC(X, Y, 1000))
user system elapsed
16.50 1.44 18.04
> all.equal(a,b)
[1] TRUE
Not bad, but I was hoping for a bigger speedup when making the leap from R to Rcpp. This is one of the first Rcpp functions I've ever written, so perhaps further optimization is possible.

Resources