progress bar for non-loop functions - r

I am currently running some functions on large data sets for which each operation takes a long time to execute.
To see the progress of my calculations, it would be handy to print the iterations/percentage of completed calculations. With loops, this can be easily done.
However, is it possible to have something similar working for vectorized functions or or pre-defined functions without actually making changes to the source code of those functions?
Example data:
generate_string taken from here : Generating Random Strings
generate_string <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
x <- generate_string(10000)
y <- generate_string(10000)
Example function to be monitored:
(i.e. printing the percentage completed):
library(stringdist)
# amatch will find for each element in x the index of the most similar element in y
ind <- amatch(x,y, method = "jw", maxDist = 1)

The pbapply is a option, but is more slow than the direct call:
system.time({ind <- amatch(x,y, method = "jw", maxDist = 1)})
user system elapsed
27.79 0.05 9.72
library(pbapply)
ind <- pbsapply(x, function(xi) amatch(xi,y, method = "jw", maxDist = 1))
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 30s
Also, the option that you comment (split data in chunks) is less elegant but faster, and this is easily parallelizable.
library(progress)
system.time({
nloops <- 20
pp <- floor(nloops * (0:(length(x)-1))/length(x)) + 1
ind <- c()
pb <- progress_bar$new(total = nloops)
for(i in 1:nloops) {
pb$tick()
ind <- c(ind, amatch(x[pp == i],y, method = "jw", maxDist = 1))
}
pb$terminate()
})
[===================================================================================] 100%
user system elapsed
25.96 0.06 9.21

Related

More efficient way to create frequency column based on different groupings?

I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently?
Create_Freq <- function(Word_List) {
library(dplyr)
Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] +
(Word_List%>% add_count(SecLet))[,"n"] +
(Word_List%>% add_count(ThirdtLet))[,"n"] +
(Word_List%>% add_count(FourLet))[,"n"] +
(Word_List%>% add_count(FifthLet))[,"n"]
return(Word_List)
}
Edit:
To provide a word list for example
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
Word_List$FirstLet <- substr(Word_List$Word,1,1)
Word_List$SecLet <- substr(Word_List$Word,2,2)
Word_List$ThirdtLet <- substr(Word_List$Word,3,3)
Word_List$FourLet <- substr(Word_List$Word,4,4)
Word_List$FifthLet <- substr(Word_List$Word,5,5)
}
For context, I have another function that will then choose the word with the highest "Average" frequency. (It used to be an average, but dividing by 5 was useless as it didn't affect the max)
Here is one possible approach, defining a small auxiliary function f to access a list of counts. When tested, it is roughly 15 times faster on my machine.
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
w <- unlist(Word_List, use.names = F)
m <- matrix(unlist(strsplit(w, ""), use.names = F), ncol = 4)
lookup <- apply(m, 1, table)
Word_List$AvgFreq <- apply(m, 2, f, lookup)
Word AvgFreg
1 final 7
2 first 7
3 lover 5
4 thing 5
Further optimizations are possible, especially using a vectorized approach.
In response to Donald. Using your approach ended up being much slower but I had to make a couple changes to get it to work with a large word list, let me know if I messed up your methodology:
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(read.delim("Word List.txt"))
Word_List <- Turn_Vector_List(Word_List)
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
# Start the clock!
ptm <- proc.time()
m <- data.matrix(Word_List[2:6])
m
lookup <- apply(m, 2, table, simplify = FALSE)
lookup
Word_List$AvgFreq <- apply(m, 1, f, lookup)
# Stop the clock
ptm2 <- proc.time() - ptm
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
Word_List2 <- Create_Freq(Word_List_Vector)
ptm3 <- proc.time() - ptm - ptm2
ptm2
# user system elapsed
# 0.89 0.78 1.69
ptm3
# user system elapsed
# 0.06 0.00 0.06

How do I iterate over several lists and matrices to call a function using parallel processing in R?

I have been trying to use am R function called ipsi, which takes arguments (a, y, id, time, x.trt, x.out, delta.seq, nsplits) Originally, the components of the arguments were in one dataframe (except for delta.seq and nsplits which are coded later), but my understanding is I needed to put them in separate lists, and in the case of x.trt and x.out, matrices. This function is very easy to run on one of each argument, but since I multiply imputed the dataframe 30 times before splitting it up into different elements to be taken as ipsi arguments, I now want to iterate over the set of elements 30 times as if there were 30 dataframes. Additionally, I want to parallelize to optimize my computing power.
I have just expanded the npcausal example:
n <- 500
T <- 4
time <- rep(1:T, n)
time <- list(time,time,time,time,time,time,time,time,time,time,time,time,time,time,time,
time,time,time,time,time,time,time,time,time,time,time,time,time,time,time)
id <- rep(1:n, rep(T, n))
id <- list(id,id,id,id,id,id,id,id,id,id,id,id,id,id,id,
id,id,id,id,id,id,id,id,id,id,id,id,id,id,id)
x.trt <- matrix(rnorm(n * T * 5), nrow = n * T)
x.trt <- list(x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,
x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt)
x.out <- matrix(rnorm(n * T * 5), nrow = n * T)
x.out <- list(x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,
x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out)
a <- rbinom(n * T, 1, .5)
a <- list(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a)
y <- rnorm(mean=1,n)
y <- list(y,y,y,y,y,y,y,y,y,y,y,y,y,y,y,
y,y,y,y,y,y,y,y,y,y,y,y,y,y,y)
d.seq <- seq(0.1, 5, length.out = 10)
d.seq <- list(d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,
d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq)
set.seed(500, kind = "L'Ecuyer-CMRG")
numcores <- future::availableCores()
cl <- parallel::makeCluster(numcores)
parallel::clusterEvalQ(cl, library(dplyr))
parallel::clusterEvalQ(cl, library(npcausal))
parallel::clusterExport(cl, "d.seq", envir = environment())
parallel::clusterEvalQ(cl, d.seq <- d.seq)
new_element <- parallel::parLapply(cl = cl, for(i in 1:30){
npcausal::ipsi(a = a[[i]],
y = y[[i]],
id = id[[i]],
time = time[[i]],
x.out = x.out[[i]],
x.trt = x.trt[[i]],
delta.seq = d.seq[[i]],
nsplits = 10)
})
This actually runs, but at the end of the process it gives me an error saying that the FUN was missing. I knew that already, but I have no FUN to call besides ipsi. Thanks for any help you can provide.
My suggestion is to first figure out how to do it with a regular base-R *apply function without worrying about parallelization. I suspect you can use mapply() for this, so something like (non confirmed):
res <- mapply(
a, y, id, time, xout, x,out, x.trt, d.seq,
FUN = function(a_i, y_i, id_i, time_i, xout_i, x,out_i, x.trt_i, d.seq_i) {
npcausal::ipsi(a = a_i, y = y_i, id = id_i, time = time_i,
x.out = x.out_i, x.trt = x.trt_i, delta.seq = d.seq_i,
nsplits = 10)
}
)
When you figured that part out, you can start thinking about parallelization.
(Disclaimer: I'm the author) If you get an mapply() solution to work, then the simplest would be to replace that as-is with future_mapply() of the future.apply package. That will parallelize on your local machine if you set plan(multisession).

Manipulation with large dimensional matrix in R

I would like to calculate the following function for each row of a matrix M of dimension 3e+07x4.
func <- function(x){
(dmultinom(c(x[c(1,2)],50-sum(x[c(1,2)])), size = NULL, rep(1/3,3), log = FALSE))/(x[3]^2+x[4]^3)
}
I am using the following code
as.numeric(unlist(apply(M, 1, function(v) func(v))))
Unfortunately, it is taking a long time. I'd like to do this in a short time.
Luckily, lgamma is a Primitive function and hence an option is to vectorize dmultinom by yourself. Here is a option combined with data.table for faster sped
set.seed(0L)
nr <- 3e7 #3e7
size <- 50L
DT <- data.table(X1=sample(1:20, nr, TRUE), X2=sample(1:20, nr, TRUE), X3=3, X4=4)
system.time({
DT[, paste0("lgX", 1L:3L) := c(lapply(1+.SD, lgamma), .(lgamma(1+size-X1-X2))), .SDcols=X1:X2][,
dmn := exp(lgamma(size + 1L) + log(1/3) * size - (lgX1 + lgX2 + lgX3)) / (X3^2 + X4^2)]
DT$dmn
})
# user system elapsed
# 7.44 0.17 7.64
Something like this, but I haven't verified the correctness. Your:
func <- function(x) {
(dmultinom(c(x[c(1,2)],50-sum(x[c(1,2)])), size = NULL, rep(1/3,3), log = FALSE))/(x[3]^2+x[4]^3)
}
can be written as:
func <- function(x) {
a <- x[c(1,2)]
b <- 50 - (a[1] + a[2])
d <- c(a, b)
e <- dmultinom(d, size = NULL, rep(1/3,3), log = FALSE)
f <- x[3]^2 + x[4]^3
e / f
}
The d part you can vectorize via matrix calculations as:
A <- M[, 1:2]
B <- 50 - (A[,1] + A[,2])
D <- cbind(A, B)
Without diving into dmultinom(), the e part can be calculated via apply() as:
prob <- rep(1/3, times = 3L)
E <- apply(D, MARGIN = 1L, function(d) {
dmultinom(d, size = NULL, prob = prob, log = FALSE)
})
The f part you can vectorize via matrix calculations as:
F <- M[,3]^2 + M[,4]^3
which gives that:
Y <- apply(M, 1, function(v) func(v))
can be written as:
Y <- E / F
Disclaimer: Haven't verified but you should get the idea of how to vectorize and avoid duplicate things.
PS. If you look at dmultinom(), I think you can vectorize that as well in a similar fashion. It's not unlikely that you can get rid of the remaining apply() call.

Optimise R Code - Sampling returns from S&P500 series

I am trying to see how "winning percentage" affects returns for a trading strategy.
I download the prices of S&P and calculating the daily returns. Then, I randomly select x% of these returns and say I correctly predicted it's direction so the return is positive. For the remainder 1-x%, I say I am wrong and the return is negative. I replicate this process say 1000 times and collect the annualised geometric return.
I vary x from 0.5 to 0.6 at 0.01 increment intervals.
Here is my code:
library(quantmod)
library(multicore)
getSymbols("^GSPC", from = "1950-1-1")
ret <- ROC(GSPC)[-1,4]
set.seed(123)
winpct <- seq(0.5, 0.6, 0.01)
ret <- coredata(ret)
system.time(res <- simplify2array(mclapply(winpct, function(x) replicate(1000, drawsample(ret, x)))))
drawsample <- function(ret, winpct){
len = length(ret)
ret = abs(ret)
win = sample(1:len, round(winpct * len))
a = c(ret[win], -ret[-win])
return(prod(1 + a) ^ (252 / length(a)) - 1)
}
Time taken:
user system elapsed
18.904 0.842 5.580
Are there any further optimisations I can do to speed things up?
I made the following two tweaks:
1/ use exp(sum(a)) rather than prod(1+a). I think you want this anyways, as you have created a log returns series with ROC(GSPC)[-1,6]. According to rbenchmark this got me a speedup of about 7%.
2/ sample from c(-1,-1) for the length of the ret series, and then multiply with the ret series, to obtain the signed series of returns. This got me another 30%.
Note that in my code, i've re-named your a as bin.
drawsample2 <- function(ret, winpct){
len = length(ret)
win = sample(c(-1,1), len, replace=TRUE, prob = c((1-winpct), winpct))
ret <- abs(ret)
bin <- ret*win
return(exp(sum(bin))^(252/length(ret)) - 1)
}
Benchmarking against your drawsample() i get ~37% speedup.
bb <- benchmark(simplify2array(mclapply(winpct, function(x) replicate(1000, drawsample(ret, x)))),
simplify2array(mclapply(winpct, function(x) replicate(1000, drawsample2(ret, x)))),
columns =c('test', 'elapsed', 'relative'),
replications = 10,
order = 'elapsed')
On my MBP, here are the benchmarks:
> bb
elapsed relative
2 17.254 1.000
1 27.734 1.607
Here's a tweak of ricardo's function that is faster for larger objects. I removed the calls to mclapply in order to isolate the performance of the functions by avoiding the network overhead required by multi-core processing.
drawsample_r <- function(ret, winpct){
len = length(ret)
win = sample(c(-1,1), len, replace=TRUE, prob = c((1-winpct), winpct))
ret <- abs(ret)
bin <- ret*win
return(exp(sum(bin))^(252/length(ret)) - 1)
}
drawsample_j <- function(ret, winpct){
len <- length(ret)
win <- c(-1L,1L)[sample.int(2L,len,TRUE,c(1-winpct,winpct))]
exp(sum(abs(ret)*win))^(252L/len)-1L
}
library(rbenchmark)
set.seed(123)
ret <- rnorm(1e6)/100 # 1 million observations
winpct <- seq(0.5, 0.6, 0.01)
benchmark(sapply(winpct, drawsample_r, ret=ret),
sapply(winpct, drawsample_j, ret=ret),
replications=10, order='elapsed')[,1:5]
# test replications elapsed relative user.self
# 2 sapply(winpct, drawsample_j, ret=ret) 10 6.963 1.000 6.956
# 1 sapply(winpct, drawsample_r, ret=ret) 10 10.852 1.559 10.689

Efficient apply or mapply for multiple matrix arguments by row

I have two matrices that I want to apply a function to, by rows:
matrixA
GSM83009 GSM83037 GSM83002 GSM83029 GSM83041
100001_at 5.873321 5.416164 3.512227 6.064150 3.713696
100005_at 5.807870 6.810829 6.105804 6.644000 6.142413
100006_at 2.757023 4.144046 1.622930 1.831877 3.694880
matrixB
GSM82939 GSM82940 GSM82974 GSM82975
100001_at 3.673556 2.372952 3.228049 3.555816
100005_at 6.916954 6.909533 6.928252 7.003377
100006_at 4.277985 4.856986 3.670161 4.075533
I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:
scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))
The function itself is very simple, just finding the t-value:
t.test.stat <- function(x, y)
{
return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
}
Splitting the matrices isn't the biggest contributor to evaluation time.
set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)
system.time( scores <- mapply(t.test.stat,
split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
# user system elapsed
# 1.57 0.00 1.58
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
# user system elapsed
# 1.14 0.00 1.14
Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.
Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()
You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.
Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):
system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
# user system elapsed
# 0 0 0
head(ans)
# [1] 0.8272511 -1.0965269 0.9862844 -0.6026452 -0.2477661 1.1896181
UPDATE
Here's a "cleaner" version using a rowVars function:
rowVars <- function(x, na.rm=FALSE, dims=1L) {
rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )
This solution avoids splitting, and lists, so maybe it will be faster than your version:
## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)
## combine them together
tmp3 <- cbind(tmp1, tmp2)
## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)],
x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)
Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:
tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))
-> user system elapsed
-> 5.640 0.012 5.705

Resources