Currently I have the following data.table :
item city dummyvar
A Austin 1
A Austin 1
A Austin 100
B Austin 2
B Austin 2
B Austin 200
A NY 1
A NY 1
A NY 100
B NY 2
B NY 2
B NY 200
and I have a user-defined function called ImbalancePoints, which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar. The way I am doing this is as follows:
my.data.table[,
.(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
by = .(city, item)
]
And for the NY case lets say that I get a data.table object like the following:
item city imb.points
A NY 3,449
where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A. However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.
I don't know if its of any help but I am also attaching the ImbalancePoints function:
library(pracma)
ImbalancePr <- function(eval.column) {
n <- length(eval.column)
imbalance <- rep(0, n)
b_t = rep(0,n)
elem_diff <- diff(eval.column)
for(i in 2:n)
{
imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
+ imbalance[i-1]*(elem_diff[i-1] == 0)
}
return(imbalance)
}
ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
bv_t <- ImbalancePr(eval.column)
w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
Tstar <- w0
E0t <- Tstar
repeat{
Tlast <- sum(Tstar)
nbt <- min(bkw_b, Tlast-1)
P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
P <- tail(P,1)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if(max(bv_t_cumsum) < bv_t_expected){break}else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if(Tlast > length(eval.column)[1]){break}else{
Tstar <- c(Tstar,Tnew)
if(length(Tstar) <= 2){
E0t <- mean(Tstar)
}else{
nt <- min(bkw_T,length(Tstar)-1)
E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
E0t <- tail(E0t,1)
}
}
}
return(sort(unique(Tstar)))
}
EDIT: Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.
This posting consist of several sections addressing different issues:
Vectorizing ImbalancePr()
Profiling ImbalancePoints()
Speeding-up movavg() with Rcpp by a factor of 4
Vectorizing ImbalancePr()
I believe ImbalancePr() can be replaced by
fImbalancePr <- function(x) c(0, sign(diff(x)))
At least, it returns the same result, wenn benchmarked (with check of results):
library(bench)
library(ggplot2)
bm <- press(
n = c(10, 100, 1000, 10000),
{
x <- rep(0, n)
set.seed(123)
x[sample(n, n/5)] <- 100
print(table(x))
mark(
ImbalancePr(x),
fImbalancePr(x)
)
}
)
Running with:
n
1 10
x
0 100
8 2
2 100
x
0 100
80 20
3 1000
x
0 100
800 200
4 10000
x
0 100
8000 2000
autoplot(bm)
fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.
Profiling ImbalancePoints()
However, this improvement does not have much impact on the overall performance of ImbalancePoints():
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
}
)
}
)
fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr().
autoplot(bm)
There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.
We can use profvis to identify where the time is spent within ImbalancePoints():
library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
for (i in 1:100) ffImbalancePoints(x)
})
Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.
The results from one run are shown in this screenshot from RStudio:
movavg() consumes 25% of the time spent in ImbalancePoints().
According to the profiling, another 20% are required for the double colon operator in pracma::movavg(). It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma).
10% are spent in calls to tail(). tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.
If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:
...
else if (type == "e") {
a <- 2/(n + 1)
y[1] <- x[1]
for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...
In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:
Choose a different weighted means function which uses only data points within a limited window.
Re-implement movavg() in C++ using Rcpp.
Speeding-up movavg() with Rcpp
Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.
EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)
library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
int nx = x.size();
double a = 2.0 / (n + 1.0);
double b = 1.0 - a;
double y;
y = x[0];
for(int k = 1; k < nx; k++){
y = a * x[k] + b * y;
}
return y;
}
")
Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):
fImbalancePoints <-
function(eval.column,
w0 = 100,
bkw_T = 10,
bkw_b = 10) {
# bv_t <- ImbalancePr(eval.column)
bv_t <- c(0, sign(diff(eval.column)))
# w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
Tstar <- w0
E0t <- Tstar
repeat {
Tlast <- sum(Tstar)
# remove warning:
# In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
if (Tlast >= length(bv_t)) break
nbt <- min(bkw_b, Tlast - 1)
# P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
# P <- tail(P, 1)
P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if (max(bv_t_cumsum) < bv_t_expected) {
break
} else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if (Tlast > length(eval.column)[1]) {
break
} else{
Tstar <- c(Tstar, Tnew)
if (length(Tstar) <= 2) {
E0t <- mean(Tstar)
} else{
nt <- min(bkw_T, length(Tstar) - 1)
# E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
# E0t <- tail(E0t, 1)
E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
}
}
}
return(sort(unique(Tstar)))
}
The benchmark
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
},
min_time = 1
)
}
)
bm
expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms
2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms
3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms
4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms
5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s
6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s
shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.
Related
I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)
UPDATE: I reduced my code to the pivotal elements to shorten it
The function_impact_calc is very slow (26 secs for 100000 records dataframe). I think the main reason is the for loop (maybe apply or map will help?). Below I simulate the data, write impact_calc function and record run time.
library(dplyr)
library(data.table)
library(reshape2)
###########################################################
# Start Simulate Data
###########################################################
BuySell <- function(m = 40, s = 4) {
S <- pmax(round(rnorm(10, m, s), 2), 0)
S.sorted <- sort(S)
data.frame(buy = rev(head(S.sorted, 5)), sell = tail(S.sorted, 5))
}
number_sates <- 10000
lst <- replicate(number_sates, BuySell(), simplify = FALSE)
# assemble prices data frame
prices <- as.data.frame(data.table::rbindlist(lst))
prices$state_id <- rep(1:number_sates, each = 5)
prices$level <- rep(1:5, times = number_sates)
prices$quantities <- round(runif(number_sates * 5, 100000, 1000000), 0)
# reshape to long format
prices_long <- reshape2::melt(prices,
id.vars = c("state_id", "quantities", "level"),
value.name = "price"
) %>%
rename("side" = "variable") %>%
setDT()
###########################################################
# End Simulate Data
###########################################################
Here is the function impact_calc which is very slow.
##########################################################
# function to optimize
impact_calc <- function(data, required_quantity) {
# get best buy and sell
best_buy <- data[, ,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, ,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
# calculate mid
mid <- 0.5 * (best_buy + best_sell)
# buys
remaining_qty <- required_quantity
impact <- 0
data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]
# i think this for loop is slow!
for (level in levels) {
price_difference <- mid - data_buy$price[level]
if (data_buy$quantities[level] >= remaining_qty) {
impact <- impact + remaining_qty * price_difference
remaining_qty <- 0
break
} else {
impact <- impact + data_buy$quantities[level] * price_difference
remaining_qty <- remaining_qty - data_buy$quantities[level]
}
}
rel_impact <- impact / required_quantity / mid
return_list <- list("relative_impact" = rel_impact)
}
The results with run time:
start_time <- Sys.time()
impact_buys <- prices_long[, impact_calc(.SD, 600000), by = .(state_id)]
end_time <- Sys.time()
end_time - start_time
# for 100000 data frame it takes
#Time difference of 26.54057 secs
Thanks for your help!
OP's suspicion is correct: By replacing the for loop by vector operations we can speed up the calculation by a factor of over 100:
required_quantity <- 600000
setDT(prices)
library(bench)
mark(
orig = prices_long[, impact_calc(.SD, required_quantity), by = .(state_id)],
mod1 = prices_long[, impact_calc2(.SD, required_quantity), by = .(state_id)],
vec_w = prices[, {
mid <- 0.5 * (buy[1L] + sell[1L])
tmp <- cumsum(quantities) - required_quantity
list(relative_impact =
sum(pmin(quantities, pmax(0, quantities - tmp)) * (mid - buy)) /
required_quantity / mid)
}, by = .(state_id)],
min_time = 1.0
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> <list>
1 orig 28.1s 28.1s 0.0356 2.21GB 1.39 1 39 28.1s <data.ta~ <Rprofme~ <bch~ <tibb~
2 mod1 13.1s 13.1s 0.0762 658.42MB 1.45 1 19 13.12s <data.ta~ <Rprofme~ <bch~ <tibb~
3 vec_w 175.1ms 196.9ms 5.19 440.19KB 2.59 6 3 1.16s <data.ta~ <Rprofme~ <bch~ <tibb~
In addition to the speed up, the vectorized version vec_w allocates remarkably less memory (about 5000 times).
Note that the vectorized version vec_w is using the original prices dataset in wide format. So, there is no need to reshape the data from wide to long format.
The second benchmark case mod1 is a version of impact_calc() where the code outside of the for loop code has been modified to make better use of the data.table syntax. These minor modifications alone account for a speed-up by a factor of 2.
The results are identical which is checked by mark().
Explanation of vec_w
If I understand correctly, the OP considers quantities in the given level order until required_quantity is reached. The last level is considered only partiallly to the extent which is required to meet required_quantity exactly.
In a vectorised version this can be achieved by nested ifelse() as shown in this example:
library(data.table)
r <- 5
dt <- data.table(q = 1:4)
dt[, csq := cumsum(q)]
dt[, tmp := csq - r]
dt[, aq := ifelse(tmp < 0, q, ifelse(q - tmp > 0, q - tmp, 0))][]
q csq tmp aq
1: 1 1 -4 1
2: 2 3 -2 2
3: 3 6 1 2
4: 4 10 5 0
The temporary variable tmp holds the difference between the cumulated sum of quantities q and the required quantity r.
The first ifelse() tests if the cumulated sum of quantities q is below the required quantity r. If so then use q without deduction. If not then use the part of q which is required to fill up the cumulated sum of actual quantities aq1 to meet the required quantity r.
The second ifelse() ensures that the quantity q minus deduction is positive (which is the case for the incomplete level) or zero (for the remaining levels below).
The actual quantities aq = c(1, 2, 2, 0) derived by the previous steps do sum up to the requested quantity r = 5.
Now, the ifelse() constructs can be replaced by pmin() and pmax():
dt[, aq := pmin(q, pmax(q - tmp, 0))]
I have verified in a separate benchmark (not posted here) that the pmin()/pmax() approach is slightly faster than the nested ifelse().
Explanation of mod1
In function impact_calc() some code lines can be modified to make use of the data.table syntax.
Thus,
best_buy <- data[, .SD,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, .SD,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
become
best_buy <- data[side == "buy" & level == 1, first(price)]
best_sell <- data[side == "sell" & level == 1, first(price)]
and
data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]
become
data_buy <- data[side == "buy"]
levels <- data[side == "buy", level]
I was quite surprised to learn that these modifications outside of the for loop already gained a substantial speed increase.
I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user system elapsed
#0.03 0.00 0.03
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#0.67 0.01 0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#110.69 0.03 112.01
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
}
return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
dt
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96
Given a multidimensional array, e.g. a zoo object z, with columns a,b,c,x. Given further a function W(w=c(1,1,1), x) which for example weights every column individually, but which also DEPENDS on the specific row value in column x. How to efficiently do row operations here, e.g. calculating the rowWeightedMeans?
It is known that R::zoo is very fast and efficient for row operations, if the function is very simple, e.g.:
W <- function(w) { return(w); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3)))
But what if W() depends on a value in that row? E.g.:
W <- function(w, x) { return(w*x); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3), z[,4]))
R complains here because it does not know how to hanlde the multi-dimensions of the arguments in the nested function.
The solution could be a for(i in 1:nrow(z)) loop, and computing the values individually for every row i. However, for large data sets this takes a enormous amount of extra computational effort and time.
EDIT
Ok guys, thanks for your time and critics. I tried and tested all your answers but must admit that the actual problem was not solved or understood. For example, I hadn't ask to rewrite my weight function or calculations, because I already presented a minimal version of much more complex calculations. The issue or question here lies much deeper. So I sat back and tried to boil down the problem to the root of the evil and found a minimal working example for you without any zoos, weightedMeans, and so on. Here you go:
z <- data.frame(matrix (1:20, nrow = 4))
colnames (z) <- c ("a", "b", "c", "x", "y")
z
# a b c x y
#1 1 5 9 13 17
#2 2 6 10 14 18
#3 3 7 11 15 19
#4 4 8 12 16 20
W <- function(abc, w, p) {
ifelse (w[1] == p, return(length(p)), return(0))
# Please do not complain! I know this is stupid, but it is an MWE
# and my calculations contained in W() are much more complex!
}
z[,"y"] <- W(z[,1:3], c(14,7,8), z[,"x"])
# same result: z[,"y"] <- apply(z[,1:3], 1, W, c(14,7,8), z[,"x"])
z
# a b c x y
#1 1 5 9 13 4
#2 2 6 10 14 4
#3 3 7 11 15 4
#4 4 8 12 16 4
# expected outcome:
# a b c x y
#1 1 5 9 13 0
#2 2 6 10 14 4
#3 3 7 11 15 0
#4 4 8 12 16 0
The problem I am facing is, that R passes all lines of z[,"x"] to the function, however, I expect it to take only the line which corresponds to the line of z[,"y"] that is currently processed internally when R loops through it. In this example, I expect 14==14 only in line number 2!
So: how to tell R to pass line by line to functions?
SOLUTION
Besides the awarded and accepted answer, I like to summarize the solution here to improve clarity and provide a better overview about the discussion.
This question was not about rewriting the specific function W (e.g. weighting). It was only about the inability of R to pass multiple row-by-row arguments to a general function. By either using z$y <- f(z$a, z$x) or z$y <- apply(z$a, 1, f, z$x), both methods only pass the first argument as row-by-row, and the second argument as a complete column with all rows. It seems that this is an intrinsic behaviour of R around which we need to work around.
To solve this, the whole row needs to be passed as a single argument to a wrapper function, which in turn then applies the specific calculations on that row. Solution for the problem with the weights:
f <- function(x) weighted.mean(x[1:3], W(c(0.1,0.5,0.3), x[4]))
z[,"wmean"] <- apply(z[,1:4], 1, f)
Solution for the geenral problem with the data frame:
f <- function(x) W(x[1:3], c(14,7,8), x[4])
z$y <- apply(z, 1, f)
Brian presents also even faster methods using compiled C code in his accepted answer. Thanks to #BrianAlbertMonroe, #jaimedash and #inscaven for dealing with the poorly clarified question and for hinting to this solution.
Haven't really worked with zoo or rowWeightedMeans but if you simply apply weights to row elements before taking the mean of them, and require the weights to depend on one of the elements of the row:
z <- matrix(rnorm(100),ncol=4)
W <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
sum(row2) / sum(weights)
}
w.means <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
If the above gives the correct answer but you're worried about quickness write the W function in Rcpp or use the built in cmpfun,
N <- 10000
z <- matrix(rnorm(N),ncol=4)
# Interpreted R function
W1 <- function(row, weights){
weights <- weights * row[4]
row2 <- row[1:3] * weights
mean(row2)
}
# Compiled R function
W2 <- compiler::cmpfun(W1)
# C++ function imported into R via Rcpp
Rcpp::cppFunction('double Wcpp(NumericVector row, NumericVector weights){
int x = row.size() ;
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
nweights = weights * row[x - 1];
for( int i = 0; i < (x-1) ; i++){
wrow[i] = row[i] * nweights[i];
}
double res = sum(wrow) / sum(nweights);
return(res);
}')
w.means0 <- apply(z,1,W,weights=c(0.1,0.5,0.3))
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3))
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3))
identical( w.means0, w.means1, w.means2 )
#[1] TRUE
Or
# Write the whole thing in C++
Rcpp::cppFunction('NumericVector WM(NumericMatrix z , NumericVector weights){
int x = z.ncol() ;
int y = z.nrow() ;
NumericVector res(y);
NumericVector wrow(x - 1);
NumericVector nweights(x - 1);
double nwsum;
double mult;
for( int row = 0 ; row < y ; row++){
mult = z(row,x-1);
nweights = weights * mult;
nwsum = sum(nweights);
for( int i = 0; i < (x-1) ; i++){
wrow[i] = z(row,i) * nweights[i] ;
}
res[row] = sum(wrow) / nwsum;
}
return(res);
}')
microbenchmark::microbenchmark(
w.means0 <- apply(z,1,W1,weights=c(0.1,0.5,0.3)),
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3)),
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3)),
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))
)
Unit: microseconds
expr min lq mean median uq max neval
w.means0 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 12114.834 12536.9330 12995.1722 12838.2805 13163.4835 15796.403 100
w.means1 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 9941.571 10286.8085 10769.7330 10410.9465 10788.6800 19526.840 100
w.means2 <- apply(z, 1, Wcpp, weights = c(0.1, 0.5, 0.3)) 10919.112 11631.5530 12849.7294 13262.9705 13707.7465 17438.524 100
w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3)) 94.172 107.9855 146.2606 125.0075 140.2695 2089.933 100
EDIT:
Incorporating the weighted.means function slows down the computation dramatically, and does not handle missing values specially according to the help file, so you will still need to write code to manage them.
> z <- matrix(rnorm(100),ncol=4)
> W <- function(row, weights){
+ weights <- weights * row[4]
+ row2 <- row[1:3] * weights
+ sum(row2) / sum(weights)
+
+ }
> W1 <- compiler::cmpfun(W)
> W2 <- function(row, weights){
+ weights <- weights * row[4]
+ weighted.mean(row[1:3],weights)
+ }
> W3 <- compiler::cmpfun(W2)
> w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))
> w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))
> identical(w.means1,w.means2)
[1] TRUE
> microbenchmark(
+ w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)),
+ w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)),
+ w.means2 < .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)) 145.315 167.4550 172.8163 172.9120 180.6920 194.673 100
w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 124.087 134.3365 143.6803 137.8925 148.7145 225.459 100
w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 307.311 346.6320 356.4845 354.7325 371.7620 412.110 100
w.means2 <- apply(z, 1, W3, weights = c(0.1, 0.5, 0.3)) 280.073 308.7110 323.0156 324.1230 333.7305 407.963 100
Here's a solution with zoo::rollapply. It produces the same answer as matrixStats::rowWeightedMeans for the simpler case.
if(! require(matrixStats)) {
install.packages('matrixStats')
library(matrixStats)
}
if(! require(zoo)) {
install.packages('zoo')
library(zoo)
}
z <- zoo (matrix (1:20, nrow = 5))
colnames (z) <- c ("a", "b", "c", "x")
z$x <- 0 # so we can see an effect below...
z
## a b c x
## 1 1 6 11 0
## 2 2 7 12 0
## 3 3 8 13 0
## 4 4 9 14 0
## 5 5 10 15 0
weights <- c(0.1,0.5,0.3)
W <- function (w) { return(w); }
z$wmean <- rowWeightedMeans(z[,1:3], w=W(weights))
## z[,new]<- doesn't work to create new columns in zoo
## objects
## use $
rowWeightMean_zoo <- function (r, W, weights) {
s <- sum(W(weights))
return(sum(r[1:3] * W(weights) / s))
}
z$wmean_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo(r, W, weights))
z
For the requirement in the question, that the return value be dependent on some ancillary data in the row, rowWeightedMeans doesn't work. But, the function passed to rollapply can be modified to use other elements of the row.
W2 <- function (w, x) { return(w * x); }
# z$wmean2 <- rowWeightedMeans(z[,1:3], w=W2(c(0.1,0.5,0.3), z[,4]))
## doesn't work
## Error in rowWeightedMeans(z[, 1:3], w = W#(c(0.1, 0.5, 0.3), z[, 4])) :
## The length of argument 'w' is does not match the number of column in 'x': 5 != 3
## In addition: Warning message:
## In `*.default`(w, x) :
## longer object length is not a multiple of shorter object length
## Calls: rowWeightedMeans -> W -> Ops.zoo -> NextMethod
rowWeightMean_zoo_dependent <- function (r, W, weights) {
s <- sum(W(weights, r[4]))
return(sum(r[1:3] * W2(weights, r[4]) / s))
}
z$wmean2_zoo <- rollapply(z, width=1, by.column=FALSE,
function (r) rowWeightMean_zoo_dependent(r, W2, weights))
z
## a b c x wmean wmean_zoo wmean2_zoo
## 1 1 6 11 0 7.111111 7.111111 NaN
## 2 2 7 12 0 8.111111 8.111111 NaN
## 3 3 8 13 0 9.111111 9.111111 NaN
## 4 4 9 14 0 10.111111 10.111111 NaN
## 5 5 10 15 0 11.111111 11.111111 NaN
I think this can be solved by clever reshaping. I would use dplyr for that - but the workflow should work similar for plyr or data.table - all these packages are heavily optimized.
for this example I assume the weight function is w(x) = w0 ^ x
Here I create some sample data z, and generic weights w (note I add a row number r to z):
library(dplyr)
library(tidyr)
N <- 10
z <- data.frame(r=1:N, a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
Now the calculation would be:
res <- z %>% gather(key,value,-r,-x) %>% # convert to long format, but keep row numbers and x
left_join(w, 'key') %>% # add generic weights
mutate(eff_weight = weight^x) %>% # calculate effective weights
group_by(r) %>% # group by the orignal lines for the weighted mean
summarise(ws = sum(value*eff_weight), ww=sum(eff_weight)) %>% # calculate to helper values
mutate(weighted_mean = ws/ww) %>% # effectively calculate the weighted mean
select(r, weighted_mean) # remove unneccesary output
left_join(z, res) # add to the original data
I added some notes - but if you have trouble understanding you could evaluate res stepwise (remove tail including %>%) and have a look at the results.
Update
took the challenge to find the way to do the same in base R:
N <- 10
z <- data.frame(a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))
long.z <- reshape(z, idvar = "row", times=c('a','b','c'),
timevar='key',
varying = list(c('a','b','c')), direction = "long")
compose.z <- merge(long.z,w, by='key')
compose.z2 <- within(compose.z, eff.weight <- weight^x)
sum.stat <- by(compose.z2, compose.z2$row, function(x) {sum(x$a * x$eff.weight )/sum(x$eff.weight)})
nice.data <- c(sum.stat)
It requires a bit more verbose function. But the same pattern can be applied.
I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>