Optimise R loop with data table passed into a recursive function - r

I'm trying to run a simulation which would run n_tests for n_products for n_years to estimate an increase in demand and subsequent increase in stored pallets (linear relation between demand and stored products assumed). To make things a bit spicier, the demand comes from 2 separate regions (A and B) but products are stored in one warehouse.
What I've done currently works, but is slow. 10 years, 200 tests and 25,000 products need 10 seconds to run.
The setup:
library(data.table)
n_products <- 25000
n_years <- 10
n_tests <- 200
pct_error <- 2
A_fcst <- runif(n_years, min = 1, max = 8)
B_fcst <- runif(n_years, min = 3, max = 6)
Populate initial DT and matrices:
yearly_demand_A <- matrix(0, n_years, n_tests)
yearly_demand_B <- matrix(0, n_years, n_tests)
for (i in 1:n_years){
yearly_demand_A[i,] <- rnorm(n_tests, A_fcst[i], pct_error*sqrt(i))
yearly_demand_B[i,] <- rnorm(n_tests, B_fcst[i], pct_error*sqrt(i))
}
yearly_pallets <- matrix(0, n_years, n_tests)
demand_x_pallets <- data.table(prod_code = 1:n_products, stock_qty = as.integer(runif(n_products,1,100)), pallet_qty = as.integer(runif(n_products,10,30)), demand_A = runif(n_products,1,40), demand_B = runif(n_products,1,40))
demand_x_pallets[,pallets := ceiling(stock_qty/pallet_qty)]
demand_x_pallets[,demand := demand_A + demand_B]
for (i in 1:n_tests){
yearly_pallets[1:n_years,i] <- number_of_pallets(yearly_demand_A[1:n_years,i], yearly_demand_B[1:n_years,i], demand_x_pallets)
}
And the function itself:
number_of_pallets <- function(fcst_A,fcst_B,d_x_p,year=0){
pallets <- vector("double",n_years)
new_profile <- copy(d_x_p) #if I don't create a copy, the same DT is passed and number of pallets compunds
if (year == 0){ #if function called without year argument call it recursively
for(i in 1:(n_years)){
new_profile <- number_of_pallets(fcst_A[[i]],fcst_B[[i]],new_profile,i)
pallets[i] <- new_profile[,sum(pallets)]
}
}
else{ #calculate demand and pallet count for each product each year
d_x_p[,demand_A := demand_A * (100+fcst_A) / 100]
d_x_p[,demand_B := demand_B * (100+fcst_B) / 100]
d_x_p[,new_Dmnd := demand_A + demand_B]
d_x_p[,Dmnd_change := ifelse(demand==0,1,new_Dmnd/demand)]
d_x_p[,stock_qty := stock_qty * Dmnd_change]
d_x_p[,pallets := ceiling(stock_qty/pallet_qty)]
d_x_p[,demand := new_Dmnd]
return(d_x_p)
}
return(pallets)
}
Initially, I thought that copying the DT might be the reason for making it slow, but removing the line in the function, apart from making it not working properly, doesn't make any difference. This is the best I came up so far after few miserable failures, but I'm completely stuck now.
Any pointers on how to tackle it differently would be greatly appreciated.

Related

Saving recursive function results to a global data frame in R

I'm trying to recreate the functionality of the memoise package in base R by saving the outputs of a recursive function in a data frame. I have this function "P" and then I made this "metaP" wrapper that will run P(n) if metaP(n) hasn't been run before and then save the results of P(n), or it produces the previously saved output. My issue is it only works at the first level. If I run metaP(5) it will save the output of metaP(5), but in order to get P(5) it also had to calculate P(4) and the results of P(4) aren't getting saved. I'm assuming it's getting lost in the recursive environments, but when I tried using the assign function and setting it to the global environment it still didn't work.
In the example below, I run metaP 5 through 10, and df has 5 through 10 saved, but it doesn't have 1 through 5 saved, some of which must have been calculated to come up with the answers of 5 through 10.
df <- data.frame(n = 0, pn = 1)
metaP <- function(n) {
if (!n %in% df$n) df <<- rbind(df, data.frame(n = n, pn = P(n)))
df[df$n == n, "pn"]
}
P <- function(n) {
if (n < 0) return(0)
k <- rep(1:((sqrt(24 * n + 1) + 1) / 6), each = 2) * c(1, -1)
return(sum((-1) ^ (k + 1) * sapply(n - k * (3 * k - 1) / 2, metaP)) %% 1e6)
}
sapply(5:10, metaP)
df
The issue here is kind of subtle. The expression
df <<- rbind(df, data.frame(n = n, pn = P(n)))
is ambiguous, because the ?rbind documentation doesn't define the order in which the two arguments to rbind() are evaluated. It appears that R is evaluating df, then doing the recursive call, then appending that result to the saved value of df. Any changes to the global variable that happened during the recursive call are lost.
To fix this, rewrite the conditional part as
if (!n %in% df$n) {
newval <- data.frame(n = n, pn = P(n))
df <<- rbind(df, newval)
}
(I'd also suggest adding parens to the test, and writing it as if (!(n %in% df$n)), because it's not immediately obvious that these are the same. I was confused about this in an earlier answer to this question. But checking ?Syntax shows that %in% has higher priority than !.)

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

Optimising a calculation on every cumulative subset of a vector in R

I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.

Simplify loop computation

Consider the following vector x:
> 1:9
[1] 1 2 3 4 5 6 7 8 9
and consider the following inputs:
start = 10
pmt = 2
This is the result (let's call the resulting vector res) I am looking to achieve (what's displayed are the actual formulas). Note that the result is a vector not a dataframe. I just displayed it here 2 dimensions.
In other words, to obtain res, you multiple start by the cumulative product for each cell of df up to the corresponding cell.
When the vector index is a multiple is 4 or 7, the start value gets updated.
This is what I have attempted:
for(i in 1:9) {
res[i] = start * cumprod(df[k:i])[i]
if(i %% 3 == 0) {
start = res[i] - pmt
k = k + 3
} else {
start = res[i]
}
}
}
To put the problem into context, imagine you have a start value of money of 10 dollars, and you want to invest it over 9 months. However, you want to make a withdrawal at the end of each 3 months (i.e. at the beginning of month 4, month 7, ...). The vector x represent random values of returns.
Therefore, at the beginning of month 4, your start value is start*1*2*3 minus the withdrawal pmt.
The purpose here is computing the wealth value at the end of month 9.
The problem is that in reality, i = 200 (200 months), and I need to redo this computation for 10,000 different vectors x. So looping 10,000 times over the above code takes forever to execute!
Would you have any suggestion as to how to compute this more efficiently? I hope the explanation is not too confusing!
Thank you!
If you work out your formula for res as an iterative formula, then it is easier to write a function that you can give to Reduce. Here it is as a simple loop
x <- 1:9
start <- 10
pmt <- 2
res <- numeric(length(x))
res[1] <- x[1] * start
for (i in seq_along(x)[-1]) {
res[i] <- (res[i-1] - (pmt * (!(i%%4) || !(i%%7)))) * x[i]
}
If you want to write it as a Reduce function, it would look like this
Reduce(function(r, i) {
(r - (pmt * (!(i%%4) || !(i%%7)))) * x[i]
},
seq_along(x),
init = start,
accumulate = TRUE)[-1]
There is some weirdness with the start values and dropping the first element of the result because of the way that initial values are handled (and that it iteration is over indexes, not values, since comparisons must be done on the index). The loop here is probably more understandable.
I know you mentioned it being 1d, but I think this works well and you can convert it to 1d very easily -
start = 10
pmt = 2
library(data.table)
dt <- data.table(
month = 1:13
)
dt[,principalgrown := start*cumprod(month)]
#explained below#######
dt[,interestlost := 0]
for(i in seq(from = 4, to = (dim(dt)[1]), by = 3))
{
dt[month >= i,interestlost := interestlost + (prod(i:month)), by = month]
}
#######################
dt[,finalamount := principalgrown - (pmt*interestlost)]
The part within the #s is the trick. Where you calculate month 7 value as ((1*2*3*start - pmt)*4*5*6 - pmt) * 7, i calculate it as 1*2*3*4*5*6*7*start - 4*5*6*7*pmt - 7*pmt. 1*2*3*4*5*6*7*start is principalgrown and - 4*5*6*7*pmt - 7*pmt is -(pmt*interestlost)

Faster solution to looped grouped RLE calculation

I have a working solution to my problem, but I will not be able to use it because it is so slow (my calculations predict that the whole simulation will take 2-3 years!). Thus I am looking for a better (faster) solution. This is (in essence) the code I am working with:
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
# second loop
for (j in 1:2000) { #second loop
#count rle for group 1 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==1&v$p==j])))
#count rle for group 2 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==2&v$p==j])))
} #end second loop
} #end first loop
#total rle counts for both group 1 & 2
y <-aggregate(x, list(as.numeric(rownames(x))), sum)
In words: The code generates a coin-flip simulation (v). A group factor is generated (1 & 2). A p.number factor is generated (1:2000). The run lengths are recorded for each p.number (1:2000) for both groups 1 & group 2 (each p.number has runs in both groups). After N loops (the first loop), the total run lengths are presented as a table (aggregate) (that is, the run lengths for each group, for each p.number, over N loops as a total).
I need the first loop because the data that I am working with comes in individual files (so I'm loading the file, calculating various statistics etc and then loading the next file and doing the same). I am much less attached to the second loop, but can't figure out how to replace it with something faster.
What can be done to the second loop to make it (hopefully, a lot) faster?
You are committing the cardinal sin of growing an object within a for() loop in R. Don't (I repeat don't) do this. Allocate sufficient storage for x at the beginning and then fill in x as you go.
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
Then in the inner loop
x[ii, ] <- table(rle(....))
where ii is a loop counter that you initialise to 1 before the first loop and increment within the second loop:
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
ii <- 1
for(i in 1:N) {
.... # stuff here
for(j in 1:2000) {
.... # stuff here
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
} ## end inner loop
} ## end outer loop
Also note that you are reusing index i in bot for()loops which will not work.iis just a normal R object and so bothfor()loops will be overwriting it as the progress. USej` for the second loop as I did above.
Try that simple optimisation first and see if that will allow the real simulation to complete in an acceptable amount of time. If not, come back with a new Q showing the latest code and we can think about other optimisations. The optimisation above is simple to do, optimising table() and rle() might take a lot more work. Noting that, you might look at the tabulate() function which does the heavy lifting in table(), which might be one avenue for optimising that particular step.
If you just want to run rle and table for each combination of the values of v$t and v$p separately, there is no need for the second loop. It is much faster in this way:
values <- v$v + v$t * 10 + v$p * 100
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- table(runlength)
y <- aggregate(unclass(x), list(as.numeric(rownames(x))), sum)
The whole code will look like this. If N is as low as 4, the growing object x will not be a severe problem. But generally I agree with #GavinSimpson, that it is not a good programming technique.
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
values <- v$v + N * 10 + v$t * 100 + v$p * 1000
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- rbind(x, table(runlength))
} #end first loop
y <-aggregate(x, list(as.numeric(rownames(x))), sum) #tota

Resources