Faster if statement evaluations in R - r

So I am trying to figure out if there is a better way to add multiple conditional statements to an if clause in R in order to speed up the process. Below is some code that I wrote that runs very fast on large datasets in the simple case and not so fast on the not so simple case. Any suggestions are greatly appreciated! Also, the tic-toc function is at the very bottom on the question in case you would like to run it yourself and see how fast the function runs.
Also, to give some intuition of what the code is doing, the first chunk is simply determining if there are any pairs of x's and y's that have larger values than all of the other x's and y's.
The second chunk of code is doing the same thing, however, it adds the condition that is any of the x values are actually equal to each other then check to see which one has the lowest y value. Likewise, if any of the y values are equal to each other than check to see which one has the lowest x value.
So, running the code in the simple case I have the following:
tic()
x = runif(10000)
y = runif(10000)
front = 1:length(x)
for(i in 1:length(x)){
for(n in 1:length(x)){
if((x[i]>x[n] & y[i]>y[n])){
front[i] = NA
break
}
}
}
toc()
So as you can see, I am only evaluating the single condition that x[i]>x[n] & y[i]>y[n]
toc()
elapsed
1.28
and the code above runs in 1.28 seconds.
Now, running the code when I have three conditions to check I have the following:
tic()
x = runif(10000)
y = runif(10000)
front = 1:length(x)
for(i in 1:length(x)){
for(n in 1:length(x)){
if((x[i]>x[n] & y[i]>y[n]) | (x[i]==x[n] & y[i]!=min(y[which(x==x[i])])) | (y[i]==y[n] & x[i]!=min(x[which(y==y[i])]))){
front[i] = NA
break
}
}
}
toc()
so as you can see, I now have to check three conditions inside my if statement, namely,
(x[i]>x[n] & y[i]>y[n]) | (x[i]==x[n] & y[i]!=min(y[which(x==x[i])])) | (y[i]==y[n] & x[i]!=min(x[which(y==y[i])]))
however, this leads to a huge computational burden in R and make the code much more slow.
> toc()
elapsed
74.47
We see that running the newly adapted code its now slowed down considerably to 74.47 seconds. Now I am looking for either alternative function calls that would speed up my code or simply rewriting it in a "better" way that the code is not so slow.
Here is the code for the tic-toc function if needed:
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
EDIT for sashkello
So my code now looks like this:
library(mvtnorm)
#Here are the variables I will be working with
> x
[1] 0.53137100 0.75357474 0.87904120 0.29727488 0.00000000 0.00000000
[7] 0.00000000 0.00000000 0.00000000 0.04059217
> y
[1] 4.873500 3.896917 1.258215 5.776484 12.475491 5.273784 13.803158
[8] 4.472204 2.629839 6.689242
> front
[1] NA NA 3 NA NA NA NA NA 9 NA
> all.preds
[1] 0.596905183 0.027696850 1.005666896 0.007688514 3.900000000
x = x[!is.na(front)]
y = y[!is.na(front)]
mu = c(all.preds[1],all.preds[3])
sigma = matrix(c(all.preds[2],0,0,all.preds[4]),nrow=2)
z = rmvnorm(10000,mu,sigma)
z[,1] = sapply(z[,1],function(x){max(x,0)})
points(z,col="black",pch=19,cex=.01)
temp = 1:nrow(z)
for(i in 1:length(temp)){
cond1 = z[i,2]!=min(z[which(z[,1]==z[i,1]),2])
cond2 = z[i,1]!=min(z[which(z[,2]==z[i,2]),1])
for(n in 1:length(x)){
if((z[i,1]>x[n] & z[i,2]>y[n]) | (z[i,1]==x[n] & cond1) | (z[i,2]==y[n] & cond2)){
temp[i] = NA
break
}
}
}
prop = sum(!is.na(temp))/length(temp)
and that cond1 and cond2 statement still take horribly long. Any suggestions?

You can put y[i]!=min(y[which(x==x[i])]) and x[i]!=min(x[which(y==y[i])]) before the second loop, because they both only involve i.
for(i in 1:length(x)){
cond1 = y[i]!=min(y[which(x==x[i])])
cond2 = x[i]!=min(x[which(y==y[i])])
for(n in 1:length(x)){
if((x[i]>x[n] & y[i]>y[n]) | (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2)){
This should speed things up significantly because both min and which are extremely slow and you are running them every time in the second loop.

Since you asked for it, here is an efficient way to calculate cond1 outside of a for loop (which you probably don't need at all):
#some data_
set.seed(42)
z <- matrix(sample(1:5, 200, TRUE), ncol=2)
#your loop
cond1 <- logical(100)
for (i in 1:100) {
cond1[i] = z[i,2]!=min(z[which(z[,1]==z[i,1]),2])
}
#alternative
library(data.table)
DT <- data.table(z)
DT[, id:=.I]
DT[, cond1:=V2!=min(V2), by=V1]
#compare results
identical(DT[, cond1], cond1)
#[1] TRUE

Related

Calculate vector whose length is not known beforehand - should I "grow" it?

I need to calculate entries of a vector whose length I do not know beforehand. How to do so efficiently?
A trivial solution is to "grow" it: start with a small or empty vector and successively append new entries until the stopping criterion is reached. For example:
foo <- numeric(0)
while ( sum(foo) < 100 ) foo <- c(foo,runif(1))
length(foo)
# 195
However, "growing" vectors is frowned upon in R for performance reasons.
Of course, I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.
Is there a better or canonical way to do this? (In my actual application, the calculation and the stopping criterion are a bit more complicated, of course.)
In reply to some useful comments
Even if you don't know the length beforehand, do you know the maximum possible length it can theoretically have? In such cases I tend to initialize the vector with that length and after the loop cut the NAs or remove the unused entries based on the latest index value.
No, the maximum length is not known in advance.
Do you need to keep all values as the vector grows?
Yes, I do.
What about something like rand_num <- runif(300); rand_num[cumsum(rand_num) < 100] where you choose a sufficiently large vector that you know for a high probability that the condition will be met? You can of course check it and use an even bigger number if it's not met. I've tested up till runif(10000) it's still faster than "growing".
My actual use case involves a dynamic calculation, which I can't simply vectorize (otherwise I would not be asking).
Specifically, to approximate the convolution of negative binomial random variables, I need to calculate the probability masses of the integer random variable $K$ in Theorem 2 in Furman, 2007 up to a high cumulative probability. These masses $pr_k$ involve some intricate recursive sums.
I could "grow it in chunks": pre-allocate a "good-sized" vector, fill it, double its length when it is full, and finally cut it down to size. But this feels error-prone and will make for inelegant code.
Sounds like you are referring to the accepted answer of Collecting an unknown number of results in a loop. Have you coded it up and tried it? The idea of length doubling is more than sufficient (see the end of this answer), as the length will grow geometrically. I will demonstrate my method in the following.
For test purpose, wrap your code in a function. Note how I avoid doing sum(z) for every while test.
ref <- function (stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- numeric(0)
sum_z <- 0
while ( sum_z < stop_sum ) {
z_i <- runif(1)
z <- c(z, z_i)
sum_z <- sum_z + z_i
}
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(z) ## return result
}
}
Chunking is necessary to reduce the operational costs of concatenation.
template <- function (chunk_size, stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- vector("list") ## store all segments in a list
sum_z <- 0 ## cumulative sum
while ( sum_z < stop_sum ) {
segmt <- numeric(chunk_size) ## initialize a segment
i <- 1
while (i <= chunk_size) {
z_i <- runif(1) ## call a function & get a value
sum_z <- sum_z + z_i ## update cumulative sum
segmt[i] <- z_i ## fill in the segment
if (sum_z >= stop_sum) break ## ready to break at any time
i <- i + 1
}
## grow the list
if (sum_z < stop_sum) z <- c(z, list(segmt))
else z <- c(z, list(segmt[1:i]))
}
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(unlist(z)) ## return result
}
}
Let's check correctness first.
z <- ref(1e+4, FALSE)
z1 <- template(5, 1e+4, FALSE)
z2 <- template(1000, 1e+4, FALSE)
range(z - z1)
#[1] 0 0
range(z - z2)
#[1] 0 0
Let's then compare speed.
## reference implementation
t0 <- ref(1e+4, TRUE)
## unrolling implementation
trial_chunk_size <- seq(5, 1000, by = 5)
tm <- sapply(trial_chunk_size, template, stop_sum = 1e+4, timing = TRUE)
## visualize timing statistics
plot(trial_chunk_size, tm, type = "l", ylim = c(0, t0), col = 2, bty = "l")
abline(h = t0, lwd = 2)
Looks like chunk_size = 200 is sufficiently good, and the speedup factor is
t0 / tm[trial_chunk_size == 200]
#[1] 16.90598
Let's finally see how much time is spent for growing vector with c, via profiling.
Rprof("a.out")
z0 <- ref(1e+4, FALSE)
Rprof(NULL)
summaryRprof("a.out")$by.self
# self.time self.pct total.time total.pct
#"c" 1.68 90.32 1.68 90.32
#"runif" 0.12 6.45 0.12 6.45
#"ref" 0.06 3.23 1.86 100.00
Rprof("b.out")
z1 <- template(200, 1e+4, FALSE)
Rprof(NULL)
summaryRprof("b.out")$by.self
# self.time self.pct total.time total.pct
#"runif" 0.10 83.33 0.10 83.33
#"c" 0.02 16.67 0.02 16.67
Adaptive chunk_size with linear growth
ref has O(N * N) operational complexity where N is the length of the final vector. template in principle has O(M * M) complexity, where M = N / chunk_size. To attain linear complexity O(N), chunk_size needs to grow with N, but a linear growth is sufficient: chunk_size <- chunk_size + 1.
template1 <- function (chunk_size, stop_sum, timing = TRUE) {
set.seed(0) ## fix a seed to compare performance
if (timing) t1 <- proc.time()[[3]]
z <- vector("list") ## store all segments in a list
sum_z <- 0 ## cumulative sum
while ( sum_z < stop_sum ) {
segmt <- numeric(chunk_size) ## initialize a segment
i <- 1
while (i <= chunk_size) {
z_i <- runif(1) ## call a function & get a value
sum_z <- sum_z + z_i ## update cumulative sum
segmt[i] <- z_i ## fill in the segment
if (sum_z >= stop_sum) break ## ready to break at any time
i <- i + 1
}
## grow the list
if (sum_z < stop_sum) z <- c(z, list(segmt))
else z <- c(z, list(segmt[1:i]))
## increase chunk_size
chunk_size <- chunk_size + 1
}
## remove this line if you want
cat(sprintf("final chunk size = %d\n", chunk_size))
if (timing) {
t2 <- proc.time()[[3]]
return(t2 - t1) ## return execution time
} else {
return(unlist(z)) ## return result
}
}
A quick test verifies that we have attained linear complexity.
template1(200, 1e+4)
#final chunk size = 283
#[1] 0.103
template1(200, 1e+5)
#final chunk size = 664
#[1] 1.076
template1(200, 1e+6)
#final chunk size = 2012
#[1] 10.848
template1(200, 1e+7)
#final chunk size = 6330
#[1] 108.183

r vector of non-sums of self items

I am trying to build a function that creates a vector where any item is NOT the sum of any combination of other items in the list (without duplication).
This function does the job but is quite slow... any bright thoughts on how to improve it?
sum_fun <- function(k)
{
out_list <- c(2,3,4)
new_num <- 4
while(length(out_list) < k)
{
new_num <- new_num + 1
#Check if new_num can be written as a sum of the terms in out_list
new_valid <- T
for (i in 2:(length(out_list) - 1)){
if (new_num %in% (apply(combn(out_list,i), FUN = sum, MAR = 2)))
{
new_valid <- F
break
}
}
if (new_valid)
{
out_list <- c(out_list, new_num)
}
}
return(out_list)
}
This was a good question. I made some changes to your original function and got mine to run a bit quicker than your function. On a side note, how many are you trying to find?
The main idea is that we shouldn't calculate more things more often than we absolutely have to. I think the for loop was probably slowing things down a bit, plus, how many of the column sums were repeated? If we can "de-dup" the list, maybe we can search through it more quickly [reduce, reuse, recycle :) ].
sum_fun2 <- function(k)
{
out_list <- c(2,3,4) #dummy list
new_num <- 4 #dummy number
calc_big_sum <- T #calculate big sum on the first go
while(length(out_list) < k)
{
new_num <- new_num + 1 #dummy number to add
#calculate big sum, and then find unique values
if(calc_big_sum){
big_sum<- unique(unlist(lapply(lapply(2:(length(out_list) - 1),
FUN = function(x) combn(out_list, m = x)),
FUN = function(y) apply(y, 2, sum))))
}
if(new_num %in% big_sum){
calc_big_sum = F #don't make it calculate the sum again
}else{
out_list <- c(out_list, new_num) #add number to list
calc_big_sum = T #make it calculate a new sum
}
}
return(out_list)
}
> system.time(sum_fun2(10))
user system elapsed
0.03 0.00 0.03
> system.time(sum_fun(10))
user system elapsed
1.30 0.00 1.27
> system.time(sum_fun2(14))
user system elapsed
3.35 0.07 3.47
> system.time(sum_fun(14))
## I ended it
Timing stopped at: 39.86 0 40.02

Loop inside a loop in R

I am trying to create an R code that puts another loop inside of the one I've already created. Here is my code:
t <- rep(1,1000)
omega <- seq(from=1,to=12,by=1)
for(i in 1:1000){
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
remove <- 0
f <- length(t [! t %in% remove]) + 1
}
When I run this code, I get a number a trials it takes f to reach the zero vector, but I want to do 10000 iterations of this experiment.
replicate is probably how you want to run the outer loop. There's also no need for the f assignment to be inside the loop. Here I've moved it outside and converted it to simply count of the elements of t that are greater than 0, plus 1.
result <- replicate(10000, {
t <- rep(1, 1000)
omega <- 1:12
for(i in seq_along(t)) {
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
}
sum(t > 0) + 1
})
I suspect your code could be simplified in other ways as well, and also that you could just write down the distribution that you're looking for without simulation. I believe your variable of interest is just how long until you get at least one of each of the numbers 1:12, yes?
Are you just looking to run your existing loop 10,000 times, like below?
t <- rep(1,1000)
omega <- seq(from=1,to=12,by=1)
f <- rep(NA, 10000)
for(j in 1:10000) {
for(i in 1:1000){
omega <- setdiff(omega,sample(1:12,1))
t[i] <- length(omega)
remove <- 0
f[j] <- length(t [! t %in% remove]) + 1
}
}

Rewriting loops with apply functions

I have the 3 following functions which I would like to make faster, I assume apply functions are the best way to go, but I have never used apply functions, so I have no idea what to do. Any type of hints, ideas and code snippets will be much appreciated.
n, T, dt are global parameters and par is a vector of parameters.
Function 1: is a function to create an m+1,n matrix containing poisson distributed jumps with exponentially distributed jump sizes. My troubles here is because I have 3 loops and I am not sure how to incorporate the if statement in the inner loop. Also I have no idea if it is at all possible to use apply functions on the outer layers of the loops only.
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n) # initializing output matrix
U <- replicate(n,runif(100,t,T)) #matrix used to decide when the jumps will happen
Y <-replicate(n,rexp(100,1/par[6])) #matrix with jump sizes
for (l in 1:n){
NT <- rpois(1,par[5]*T) #number of jumps
k=0
for (j in seq(t,T,dt)){
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
Function 2: calculates a default intensity, based on Brownian motions and the jumps from function 1. Here my trouble is how to use apply functions when the variable used for the calculation is the values from the row above in the output matrix AND how to get the right values from the external matrices which are used in the calculations (BMz_C & J)
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path output
lambda[1,] <- par[4] #initializing start value of the intensity path.
J <- jump(t,T,par) #matrix containing jumps
for(i in 2:(m+1)){
dlambda <- par[1]*(par[2]-max(lambda[i-1,],0))*dt+par[3]*sqrt(max(lambda[i- 1,],0))*BMz_C[i,]+(J[i,]-J[i-1,])
lambda[i,] <- lambda[i-1,]+dlambda
}
return(lambda)
}
Function 3: calculates a survival probability based on the intensity from function 2. Here a() and B() are functions that return numerical values. My problem here is that the both value i and j are used because i is not always an integer which thus can to be used to reference the external matrix. I have earlier tried to use i/dt, but sometimes it would overwrite one line and skip the next lines in the matrix, most likely due to rounding errors.
S <- function(t=0,T=T,par,plot=0, fit=0){
S <- matrix(0,(T-t)/dt+1,n)
if (fit > 0) S.fit <- matrix(0,1,length(mat)) else S.fit <- 0
l=lambda(t,T,par,fit)
j=0
for (i in seq(t,T,dt)){
j=j+1
S[j,] <- a(i,T,par)*exp(B(i,T,par)*l[j,])
}
return(S)
}
Sorry for the long post, any help for any of the functions will be much appreciated.
EDIT:
First of all thanks to digEmAll for the great reply.
I have now worked on vectorising function 2. First I tried
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path input
J <- jump(t,T,par,fit)
lambda[1,] <- par[4]
lambda[2:(m+1),] <- sapply(2:(m+1), function(i){
lambda[i-1,]+par[1]*(par[2]-max(lambda[i-1,],0))*dt+par[3]*sqrt(max(lambda[i-1,],0))*BMz_C[i,]+(J[i,]-J[i-1,])
})
return(lambda)
}
but it would only produce the first column. So I tried a two step apply function.
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path input
J <- jump(t,T,par,fit)
lambda[1,] <- par[4]
lambda[2:(m+1),] <- sapply(1:n, function(l){
sapply(2:(m+1), function(i){
lambda[i-1,l]+par[1]*(par[2]-max(lambda[i-1,l],0))*dt+par[3]*sqrt(max(lambda[i-1,l],0))*BMz_C[i,l]+(J[i,l]-J[i-1,l])
})
})
return(lambda)
}
This seems to work, but only on the first row, all rows after that have an identical non-zero value, as if lambda[i-1] is not used in the calculation of lambda[i], does anyone have an idea how to manage that?
I'm going to explain to you, setp-by-step, how to vectorize the first function (one possible way of vectorization, maybe not the best one for your case).
For the others 2 functions, you can simply apply the same concepts and you should be able to do it.
Here, the key concept is: start to vectorize from the innermost loop.
1) First of all, rpois can generate more than one random value at a time but you are calling it n-times asking one random value. So, let's take it out of the loop obtaining this:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T) # note the change
for (l in 1:n){
NT <- NTs[l] # note the change
k=0
for (j in seq(t,T,dt)){
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
2) Similarly, it is useless/inefficient to call seq(t,T,dt) n-times in the loop since it will always generate the same sequence. So, let's take it out of the loop and store into a vector, obtainig this:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt) # note the change
for (l in 1:n){
NT <- NTs[l]
k=0
for (j in js){ # note the change
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
3) Now, let's have a look at the innermost loop:
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
this is equal to :
u <- as.integer(U[1:NT,l]<=j)
temp <- sum(Y[1:NT,l]*u)
or, in one-line:
temp <- sum(Y[1:NT,l] * as.integer(U[1:NT,l] <= j))
hence, now the function can be written as :
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt)
for (l in 1:n){
NT <- NTs[l]
k=0
for (j in js){
k=k+1
if (NT>0){
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}else jump[k,l]=0
}
}
return(jump)
}
4) Again, let's have a look at the current innermost loop:
for (j in js){
k=k+1
if (NT>0){
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}else jump[k,l]=0
}
as you can notice, NT does not depend on the iteration of this loop, so the inner if can be moved outside, as follows:
if (NT>0){
for (j in js){
k=k+1
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}
}else{
for (j in js){
k=k+1
jump[k,l]=0
}
}
this seems worse than before, well yes it is, but now the 2 conditions can be turned into one-liner's (note the use of sapply¹):
if (NT>0){
jump[1:length(js),l] <- sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else{
jump[1:length(js),l] <- 0
}
obtaining the following jump function:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt)
for (l in 1:n){
NT <- NTs[l]
if (NT>0){
jump[1:length(js),l] <- sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else{
jump[1:length(js),l] <- 0
}
}
return(jump)
}
5) finally we can get rid of the last loop, using again the sapply¹ function, obtaining the final jump function :
jump <- function(t=0,T=T,par){
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
js <- seq(t,T,dt)
NTs <- rpois(n,par[5]*T)
jump <- sapply(1:n,function(l){
NT <- NTs[l]
if (NT>0){
sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else {
rep(0,length(js))
}
})
return(jump)
}
(¹)
sapply function is pretty easy to use. For each element of the list or vector passed in the X parameter, it applies the function passed in the FUN parameter, e.g. :
vect <- 1:3
sapply(X=vect,FUN=function(el){el+10}
# [1] 11 12 13
since by default the simplify parameter is true, the result is coerced to the simplest possible object. So, for example in the previous case the result becomes a vector, while in the following example result become a matrix (since for each element we return a vector of the same size) :
vect <- 1:3
sapply(X=vect,FUN=function(el){rep(el,5)})
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 1 2 3
# [3,] 1 2 3
# [4,] 1 2 3
# [5,] 1 2 3
Benchmark :
The following benchmark just give you an idea of the speed gain, but the actual performances may be different depending on your input parameters.
As you can imagine, jump_old corresponds to your original function 1, while jump_new is the final vectorized version.
# let's use some random parameters
n = 10
m = 3
T = 13
par = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6)
dt <- 3
set.seed(123)
system.time(for(i in 1:5000) old <- jump_old(T=T,par=par))
# user system elapsed
# 12.39 0.00 12.41
set.seed(123)
system.time(for(i in 1:5000) new <- jump_new(T=T,par=par))
# user system elapsed
# 4.49 0.00 4.53
# check if last results of the 2 functions are the same:
isTRUE(all.equal(old,new))
# [1] TRUE

Fast calculations of the Pareto front in R

So I am trying to calculate the Pareto front (http://en.wikipedia.org/wiki/Pareto_efficiency) in R and am able to do it, however, I am not able to do it efficiently. In particular as the number of pairs of points increases, the computations slow down considerably.
So in general, what I want to do is check for all non-dominated (or dominated) pairs. Now the way I have been doing this is to find all such pair of points such that xi > X
and yi > Y where (xi, yi) are a single pair and X and Y represent all points x and y. Now, this part works very fast and is easy to implement, however, there is the additional possibility that multiple x values may be the same but they will have different y values so in that case I want to be able to identify the x value that has the lowest y value (and vise versa for points that have identical y values but different x values).
To illustrate this point here is a picture from Wikipedia:
so basically I want to be able to identify all points that lie on the red line.
Here is my code that does work but is very inefficient for large datasets:
#Example Data that actually runs quickly
x = runif(10000)
y = runif(10000)
pareto = 1:length(x)
for(i in 1:length(x)){
cond1 = y[i]!=min(y[which(x==x[i])])
cond2 = x[i]!=min(x[which(y==y[i])])
for(n in 1:length(x)){
if((x[i]>x[n] & y[i]>y[n]) | (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2)){
pareto[i] = NA
break
}
}
}
#All points not on the red line should be marks as NA in the pareto variable
The slow down definitely comes from calculating the points where (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2) but I cannot find a way around it or a better Boolean expression to capture everything that I want. any suggestions greatly appreciated!
Following #BrodieG
system.time( {
d = data.frame(x,y)
D = d[order(d$x,d$y,decreasing=FALSE),]
front = D[which(!duplicated(cummin(D$y))),]
} )
user system elapsed
0.02 0.00 0.02
which is 0.86/0.02 = 43 times faster!
EDIT: new version:
system.time( {
pareto.2 <- logical(length(x))
x.sort <- sort(x)
y.sort <- y[order(x)]
y.min <- max(y)
for(i in 1:length(x.sort)) {
if(pareto.2[i] <- y.sort[i] <= y.min) y.min <- y.sort[i]
}
} )
# user system elapsed
# 0.036 0.000 0.035
OLD VERSION:
This is about 6x faster on my system. You can probably do better with a better algorithm, as well as with Rcpp, but this was straightforward. The trick here is to sort by x, which then allows you to limit your check to making sure that all prior values of x must have greater values of y to ensure that point is on the frontier.
system.time( {
pareto.2 <- logical(length(x))
x.sort <- sort(x)
y.sort <- y[order(x)]
for(i in 1:length(x.sort)) {
pareto.2[i] <- all(y.sort[1:i] >= y.sort[i])
}
} )
# user system elapsed
# 0.86 0.00 0.88
The original:
pareto = 1:length(x)
system.time(
for(i in 1:length(x)){
cond1 = y[i]!= min(y[which(x==x[i])])
cond2 = x[i]!= min(x[which(y==y[i])])
for(n in 1:length(x)){
if((x[i]>x[n] & y[i]>y[n]) | (x[i]==x[n] & cond1) | (y[i]==y[n] & cond2)){
pareto[i] = NA
break
}
}
}
)
# user system elapsed
# 5.32 0.00 5.33
And showing the two methods produce the same result (a bit tricky because I need to re-order pareto.2 to the original order of x):
all.equal(pareto.2[match(1:length(x), order(x))], !is.na(pareto))
# [1] TRUE
Wanted to share with you my solution as a function. It's been tested and works pretty good for N Pareto-fronts. Set to fronts = Inf to calculate all fronts.
pareto_front <- function(x, y, fronts = 1, sort = TRUE) {
stopifnot(length(x) == length(y))
d <- data.frame(x, y)
Dtemp <- D <- d[order(d$x, d$y, decreasing = FALSE), ]
df <- data.frame()
i <- 1
while (nrow(Dtemp) >= 1 & i <= max(fronts)) {
these <- Dtemp[which(!duplicated(cummin(Dtemp$y))), ]
these$pareto_front <- i
df <- rbind(df, these)
Dtemp <- Dtemp[!row.names(Dtemp) %in% row.names(these), ]
i <- i + 1
}
ret <- merge(x = d, y = df, by = c("x", "y"), all.x = TRUE, sort = sort)
return(ret)
}

Resources