parallel nested foreach loops

parallel nested foreach loops - r

I'm trying to code a nested parallel foreach loop for a Metropolis-Hastings algorithm, but the matrices aren't combining correctly. Sample code is below, the final matrix, mtx2, should be same dimensions as the original, mtx, but with some rows randomly altered. How should the matrix rows be combined?
I tried the foreach package directly, but same result - mtx2 combines the columns 5 times.
# library(doParallel)
library(foreach)
no_cores <- detectCores() - 2
cl <- makeCluster(no_cores)
registerDoParallel(cl)
mtx <- matrix(data=rnorm(n=1e3*5,mean=0,sd=1),nrow=1e3,ncol=5)
mtx2 <- matrix(data=NA,nrow=1e3,ncol=5)
#basic for loop - slow for large number of rows
for(k in 1:nrow(mtx)){
for(r in 1:5) {
if(runif(n=1,min=0,max=1)>0.9){
mtx2[k,] <- mtx[k,]*10
}else{
mtx2[k,] <- mtx[k,]
}
}
}
#series version for de-bugging
mtx2 <-foreach(k=1:nrow(mtx),.combine="rbind") %do% {
foreach(r=1:5,.combine="c") %do% {
if(runif(n=1,min=0,max=1)>0.9){
mtx[k,]*10
}else{
mtx[k,]
}
}
}
#parallel version
mtx2 <-foreach(k=1:nrow(mtx),.combine="rbind") %:% {
foreach(r=1:5,.combine="c") %dopar% {
if(runif(n=1,min=0,max=1)>0.9){
mtx[k,]*10
}else{
mtx[k,]
}
}
}
mtx2 <- round(mtx2,2)

To expand on comments, you can skip the loop by creating your logical comparison all at once. Here, we create runif(nrow(mtx) * ncol(mtx)) but only take every 5th result to match up the OP inner loop of for (r in 1:5) {...}
The key point is that while the OP question of finding a method of updating a matrix in a nested parallel loop is not possible for this approach, refactoring code can sometimes provide significant performance gains.
nr = 1e4
nc = 5
mtx <- matrix(data=rnorm(n=nr*nc,mean=0,sd=1),nrow=nr,ncol=nc)
set.seed(123L)
lgl = matrix(runif(n = nr * nc), ncol = nc, byrow = TRUE)[, nc] > 0.9
mtx3 = sweep(mtx, 1L, 1 + 9 * lgl, FUN = '*')
all.equal(mtx2, mtx3) ##mtx2 was created with set.seed(123L)
# [1] TRUE
For 1 million rows this is significantly faster:
system.time({
lgl = matrix(runif(n = nr * nc), ncol = nc, byrow = TRUE)[, nc] > 0.9
mtx3 = sweep(mtx, 1L, 1 + 9 * lgl, FUN = '*')
})
## user system elapsed
## 0.27 0.00 0.27
system.time({
for(k in 1:nrow(mtx)){
for(r in 1:5) {
if(runif(n=1,min=0,max=1)>0.9){
mtx2[k,] <- mtx[k,]*10
}else{
mtx2[k,] <- mtx[k,]
}
}
}
})
## user system elapsed
## 14.09 0.03 14.12

Related

R: performance issues when computing mutual information matrix with NAs

I realized that computing mutual information on a dataframe with NA using R's infotheo package does not yield errors but incorrect results. The problem is described in more detail here but while I now have a mathematically correct solution which only removes pairwise incomplete cases instead of across all columns the performance for large data sets it catastrophic. I guess it is the nested for loop which causes the long compute times, does anyone have an idea how to improve performance of the below code?
library(infotheo)
v1 <- c(1,2,3,4,5,NA,NA,NA,NA,NA)
v2 <- c(1,NA,3,NA,5,NA,7,NA,9,NA)
v3 <- c(NA,2,3,NA,NA,6,7,NA,7,NA)
v4 <- c(NA,NA,NA,NA,NA,6,7,8,9,10)
df <- cbind.data.frame(v1,v2,v3,v4)
ColPairMap<-function(df){
t <- data.frame(matrix(ncol = ncol(df), nrow = ncol(df)))
colnames(t) <- colnames(df)
rownames(t) <- colnames(df)
for (j in 1:ncol(df)) {
for (i in 1:ncol(df)) {
c(1:ncol(df))
if (nrow(df[complete.cases(df[,c(i,j)]),])>0) {
t[j,i] <- natstobits(mutinformation(df[complete.cases(df[,c(i,j)]),j], df[complete.cases(df[,c(i,j)]),i]))
} else {
t[j,i] <- 0
}
}
}
return(t)
}
ColPairMap(df)
Thanks in advance!

Twice the speed.
ColPairMap2 <- function(df){
t <- matrix(0, ncol = ncol(df), nrow = ncol(df),
dimnames = list(colnames(df), colnames(df)))
df <- as.matrix(df)
for (j in 1:ncol(df)) {
for (i in j:ncol(df)) {
compl_cases <- complete.cases(df[, c(i, j)])
if (sum(compl_cases) > 0) {
t[j,i] <- natstobits(mutinformation(df[compl_cases, j],
df[compl_cases, i]))
}
}
}
lt <- lower.tri(t)
t[lt] <- t[lt] + t(t)[lt]
t
}
all(ColPairMap(df) == ColPairMap2(df))
#[1] TRUE
Test the speed.
library(microbenchmark)
mb <- microbenchmark(
f1 = ColPairMap(df),
f2 = ColPairMap2(df)
)
print(mb, order = "median", unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval cld
# f2 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 100 a
# f1 2.035973 2.01852 1.907398 2.008894 2.108486 0.569771 100 b

I found a tweak which is not helping for toy data sets as df above but for real world data sets, especially when executed on some proper H/W I've seen examples where it reduces a 2.5hrs compute time to 14min!
The code below is a complete copy&pastable exmple which incorporates Rui's solution using a nested for loop and building on this idea another solution using a nested 'foreach' loop parallelizing the task on 75% of the available cores.
You can control the size of the data set and consequently the compute time by adjusting n.
library(foreach)
library(parallel)
library(doParallel)
library(infotheo)
n <- 500 #creates an nXn matrix, the larger the more compute time is required
df <- (discretize(matrix(rnorm(4*n*n,n,n/10),ncol=n)))
## pairwise complete mutual information via nested for loop ##
start_for <- Sys.time()
ColPairMap<-function(df){
t <- data.frame(matrix(ncol = ncol(df), nrow = ncol(df)))
colnames(t) <- colnames(df)
rownames(t) <- colnames(df)
for (j in 1:ncol(df)) {
for (i in 1:ncol(df)) {
c(1:ncol(df))
if (nrow(df[complete.cases(df[,c(i,j)]),])>0) {
t[j,i] <- natstobits(mutinformation(df[complete.cases(df[,c(i,j)]),j], df[complete.cases(df[,c(i,j)]),i]))
} else {
t[j,i] <- 0
}
}
}
return(t)
}
ColPairMap(df)
end_for <- Sys.time()
end_for-start_for
## pairwise complete mutual information via nested foreach loop ##
start_foreach <- Sys.time()
ncl <- max(2,floor(detectCores()*0.75)) #number of cores
clst <- makeCluster(n=ncl,type="TERR") #create cluster
#e <- new.env() #new environment to export libraries to cores
#e$libs <- .libPaths()
#clusterExport(clst, "libs", envir=e) #export required packages to all cores
#clusterEvalQ(clst, .libPaths(libs)) #export required packages to all cores
clusterEvalQ(clst, { #export required packages to all cores
library(infotheo)
})
registerDoParallel(cl = clst) #register cluster
t <- foreach (j=1:ncol(df), .combine="c") %:% #parallellized nested loop for computing normalized pairwise complete MI between all columns
foreach (i=j:ncol(df), .combine="c", .packages="infotheo") %dopar% {
combine="c"
compl_cases <- complete.cases(df[,c(i,j)])
if (sum(compl_cases) > 0) {
natstobits(mutinformation(df[compl_cases,][,j], df[compl_cases,][,i]))
} else {
0
}
}
RCA_MI_Matrix <- matrix(0, ncol = ncol(df), nrow = ncol(df), dimnames = list(colnames(df), colnames(df))) #set-up empty matrix for MI values
RCA_MI_Matrix[lower.tri(RCA_MI_Matrix, diag=TRUE)] <- t #fill lower triangle with MI values from nested loop
RCA_MI_Matrix[upper.tri(RCA_MI_Matrix)] <- t(RCA_MI_Matrix)[upper.tri(RCA_MI_Matrix)] #mirror lower triangle of matrix into upper one
end_foreach <- Sys.time()
end_foreach-start_foreach
stopCluster(cl=clst) #stop cluster

How to minimize the unacceptably long run time of the created R code

There is a code with three for loops running with data containing enough missing values. The major problem is with the unacceptably long run time which seems to take at least more than a month although I try to keep my PC opened during most of the day.
The structure below is 100% correct from what I am trying to achieve when I test with a very few data points. But as the number of columns and rows become 2781 and 280, respectively, I perceive it takes forever although I am 100% sure that this is running correctly even when I see the updated environment window of my R-Studio each time I refresh it.
My data also has lots of missing values, probably 40% or something. I think this is making the computation time extremely longer as well.
The data dimension is 315 * 2781.
However, I am trying to achieve an output in a 280 * 2781 matrix form.
May I please get help minimizing the run time of this following code?
It would be very appreciated if I can!
options(java.parameters = "- Xmx8000m")
memory.limit(size=8e+6)
data=read.table("C:/Data/input.txt",T,sep="\t");
data=data.frame(data)[,-1]
corr<-NULL
corr2<-NULL
corr3<-NULL
for(i in 1:280)
{
corr2<-NULL
for(j in 1:2781)
{
data2<-data[,-j]
corr<-NULL
for(k in 1:2780)
{
ifelse((is.error(grangertest(data[i:(i+35),j] ~ data2[i:(i+35),k], order = 1, na.action = na.omit)$P[2])==TRUE) || (grangertest(data[i:(i+35),j] ~ data2[i:(i+35),k], order = 1, na.action = na.omit)$P[2])>0.05|| (is.na(grangertest(data[i:(i+35),j] ~ data2[i:(i+35),k], order = 1, na.action = na.omit)$P[2])==TRUE),corr<-cbind(corr,0),corr<-cbind(corr,1))
}
corr2<-rbind(corr2,corr)
}
corr3<-rbind(corr3,rowSums(corr2))
}
The snippet of my data is as below:
> dput(data[1:30, 1:10])
structure(c(0.567388170165941, 0.193093325709924, 0.965938209090382,
0.348295788047835, 0.496113050729036, 0.0645384560339153, 0.946750836912543,
0.642093246569857, 0.565092500532046, 0.0952424583956599, 0.444063827162609,
0.709971546428278, 0.756330407923087, 0.601746253203601, 0.341865634545684,
0.953319212188944, 0.0788604547269642, 0.990508111426607, 0.35519331949763,
0.697004508692771, 0.285368352662772, 0.274287624517456, 0.575733694015071,
0.12937490013428, 0.00476219342090189, 0.684308280004188, 0.189448777819052,
0.615732178557664, 0.404873769031838, 0.357331350911409, 0.565436001634225,
0.380773033713922, 0.348490287549794, 0.0473814208526164, 0.389312234241515,
0.562123290728778, 0.30642102798447, 0.911173274740577, 0.566258994862437,
0.837928073247895, 0.107747194357216, 0.253737836843356, 0.651503744535148,
0.187739939894527, 0.951192815322429, 0.740037888288498, 0.0817571650259197,
0.740519099170342, 0.601534485351294, 0.120900869136676, 0.415282893227413,
0.591146623482928, 0.698511375114322, 0.08557975362055, 0.139396222075447,
0.303953414550051, 0.0743798329494894, 0.0293272000271827, 0.335832208395004,
0.665010208031163, 0.0319741254206747, 0.678886031731963, 0.154593498911709,
0.275712370406836, 0.828485634410754, 0.921500099124387, 0.651940459152684,
0.00574865937232971, 0.82236105017364, 0.55089360428974, 0.209424041677266,
0.861786168068647, 0.672873278381303, 0.301034058211371, 0.180336013436317,
0.481560358777642, 0.901354183442891, 0.986482679378241, 0.90117057505995,
0.476308439625427, 0.638073122361675, 0.27481731469743, 0.689271076582372,
0.324349449947476, 0.56620552809909, 0.867861548438668, 0.78374840435572,
0.0668482843320817, 0.276675389613956, 0.990600393852219, 0.990227151894942,
0.417612489778548, 0.391012848122045, 0.348758921027184, 0.0799746725242585,
0.88941288786009, 0.511429069796577, 0.0338982092216611, 0.240115304477513,
0.0268365524243563, 0.67206134647131, 0.816803207853809, 0.344421110814437,
0.864659120794386, 0.84128700569272, 0.116056860191748, 0.303730394458398,
0.48192183743231, 0.341675494797528, 0.0622653553728014, 0.823110743425786,
0.483212807681412, 0.968748248415068, 0.953057422768325, 0.116025703493506,
0.327919023809955, 0.590675016632304, 0.832283023977652, 0.342327545629814,
0.576901035616174, 0.942689201096073, 0.59300709143281, 0.565881528891623,
0.600007816683501, 0.133237989619374, 0.873827134957537, 0.744597729761153,
0.755133397178724, 0.0245723063126206, 0.97799762734212, 0.636845340020955,
0.73828601022251, 0.644093665992841, 0.57204390084371, 0.496023115236312,
0.703613247489557, 0.149237307952717, 0.0871439634356648, 0.0632112647872418,
0.83703236351721, 0.433215840253979, 0.430483993608505, 0.924051651498303,
0.913056606892496, 0.914889572421089, 0.215407102368772, 0.76880722376518,
0.269207723205909, 0.865548757137731, 0.28798541566357, 0.391722843516618,
0.649806497385725, 0.459413924254477, 0.907465039752424, 0.48731207777746,
0.554472463205457, 0.779784266138449, 0.566323830280453, 0.208658932242543,
0.958056638715789, 0.61858483706601, 0.838681482244283, 0.286310768220574,
0.895410191034898, 0.448722236789763, 0.297688684659079, 0.33291415637359,
0.0115265529602766, 0.850776052568108, 0.764857453294098, 0.469730701530352,
0.222089925780892, 0.0496484278701246, 0.32886885642074, 0.356443469878286,
0.612877089297399, 0.727906176587567, 0.0292073413729668, 0.429160050582141,
0.232313714455813, 0.678631312213838, 0.642334033036605, 0.99107678886503,
0.542449960019439, 0.835914565017447, 0.52798323193565, 0.303808332188055,
0.919654499506578, 0.944237019168213, 0.52141259261407, 0.794379767496139,
0.72268659202382, 0.114752230467275, 0.175116094760597, 0.437696389388293,
0.852590200025588, 0.511136321350932, 0.30879021063447, 0.174206420546398,
0.14262041519396, 0.375411552377045, 0.0204910831525922, 0.852757754037157,
0.631567053496838, 0.475924106314778, 0.508682047016919, 0.307679089019075,
0.70284536993131, 0.851252349093556, 0.0868967010173947, 0.586291917832568,
0.0529140203725547, 0.440692059928551, 0.207642213441432, 0.777513341512531,
0.141496006632224, 0.548626560717821, 0.419565241318196, 0.0702310993801802,
0.499403427587822, 0.189343606121838, 0.370725362794474, 0.888076487928629,
0.83070912421681, 0.466137421084568, 0.177098380634561, 0.91202046489343,
0.142300580162555, 0.823691181838512, 0.41561916610226, 0.939948018174618,
0.806491429451853, 0.795849160756916, 0.566376683535054, 0.36814984655939,
0.307756055146456, 0.602875682059675, 0.506007500691339, 0.538658684119582,
0.420845189364627, 0.663071365095675, 0.958144341595471, 0.793743418296799,
0.983086514985189, 0.266262857476249, 0.817585011478513, 0.122843299992383,
0.989197303075343, 0.71584410732612, 0.500571243464947, 0.397394519997761,
0.659465527161956, 0.459530522814021, 0.602246116613969, 0.250076721422374,
0.17533828667365, 0.6599256307818, 0.184704560553655, 0.15679649473168,
0.513444944983348, 0.205572377191857, 0.430164282443002, 0.131548407254741,
0.914019819349051, 0.935795902274549, 0.857401241315529, 0.977940042736009,
0.41389597626403, 0.179183913161978, 0.431347143370658, 0.477178965462372,
0.121315707685426, 0.107695729471743, 0.634954946814105, 0.859707030234858,
0.855825762730092, 0.708672808250412, 0.674073817208409, 0.672288877889514,
0.622144045541063, 0.433355041313916, 0.952878215815872, 0.229569894727319,
0.289388840552419, 0.937473804224283, 0.116283216979355, 0.659604362910613,
0.240837284363806, 0.726138337515295, 0.68390148691833, 0.381577257299796,
0.899390475358814, 0.26472729514353, 0.0383855854161084, 0.855232689995319,
0.655799814499915, 0.335587574867532, 0.163842789363116, 0.0353666560258716,
0.048316186061129), .Dim = c(30L, 10L))

I converted just the inner loop to mapply and did a quick speed test:
library(lmtest)
data <- matrix(runif(315*2781), nrow = 315)
get01 <- function(x, y) {
try(gt <- grangertest(x ~ y, order = 1, na.action = na.omit)$P[2])
if (exists("gt")) {
if (gt > 0.05 || is.na(gt)) {
return(0)
} else {
return(1)
}
} else {
return(0)
}
}
i <- 1; j <- 1
system.time(corr <- mapply(function(k) {get01(data[i:(i+35),j], data[i:(i+35),k])}, (1:2781)[-j]))
#> user system elapsed
#> 21.505 0.014 21.520
It would need to perform that mapply 778680 times, so that puts it at about 200 days. You'll either need a different approach with the Granger test or several cores. Here's the command to replace the full loop:
corr3 <- t(mapply(function(i) colSums(mapply(function(j) mapply(function(k) {get01(data[i:(i+35),j], data[i:(i+35),k])}, (1:2781)[-j]), 1:2781)), 1:280))
Replace that first mapply with simplify2array(parLapply to parallelize:
library(parallel)
cl <- makeCluster(detectCores())
clusterExport(cl, list("data", "get01"))
parLapply(cl, cl, function(x) require(lmtest))
corr3 <- t(simplify2array(parLapply(cl, 1:280, function(i) colSums(mapply(function(j) mapply(function(k) {get01(data[i:(i+35),j], data[i:(i+35),k])}, (1:2781)[-j]), 1:2781)))))
stopCluster(cl)

Here is a version, not parallelized, that speeds up the code in the question by a factor greater than 4.
Some bottlenecks in the question's code are easy to detect:
The matrices corr? are extended inside the loops. The solution is to reserve memory beforehand;
The test grangertest is called 3 times per inner iteration when only one is needed;
To cbind with 0 or 1 is in fact creating a vector, not a matrix.
Here is a comparative test between the question's code and the function below.
library(lmtest)
# avoids loading an extra package
is.error <- function(x){
inherits(x, c("error", "try-error"))
}
Lag <- 5L
nr <- nrow(data)
nc <- ncol(data)
t0 <- system.time({
corr<-NULL
corr2<-NULL
corr3<-NULL
for(i in 1:(nr - Lag))
{
corr2<-NULL
data3 <- data[i:(i + Lag), ]
for(j in 1:nc)
{
data2<-data[,-j]
corr<-NULL
for(k in 1:(nc - 1L))
{
ifelse((is.error(grangertest(data[i:(i+Lag),j] ~ data2[i:(i+Lag),k], order = 1, na.action = na.omit)$P[2])==TRUE) ||
(grangertest(data[i:(i+Lag),j] ~ data2[i:(i+Lag),k], order = 1, na.action = na.omit)$P[2])>0.05 ||
(is.na(grangertest(data[i:(i+Lag),j] ~ data2[i:(i+Lag),k], order = 1, na.action = na.omit)$P[2])==TRUE),
corr<-cbind(corr,0),
corr<-cbind(corr,1)
)
}
corr2 <- rbind(corr2, corr)
}
corr3<-rbind(corr3, rowSums(corr2))
}
corr3
})
I will use a simplified version of lmtest::grangertest.
granger_test <- function (x, y, order = 1, na.action = na.omit, ...) {
xnam <- deparse(substitute(x))
ynam <- deparse(substitute(y))
n <- length(x)
all <- cbind(x = x[-1], y = y[-1], x_1 = x[-n], y_1 = y[-n])
y <- as.vector(all[, 2])
lagX <- as.matrix(all[, (1:order + 2)])
lagY <- as.matrix(all[, (1:order + 2 + order)])
fm <- lm(y ~ lagY + lagX)
rval <- lmtest::waldtest(fm, 2, ...)
attr(rval, "heading") <- c("Granger causality test\n", paste("Model 1: ",
ynam, " ~ ", "Lags(", ynam, ", 1:", order, ") + Lags(",
xnam, ", 1:", order, ")\nModel 2: ", ynam, " ~ ", "Lags(",
ynam, ", 1:", order, ")", sep = ""))
rval
}
And now the function to run the tests.
f_Rui <- function(data, Lag){
nr <- nrow(data)
nc <- ncol(data)
corr3 <- matrix(0, nrow = nr - Lag, ncol = nc)
data3 <- matrix(0, nrow = Lag + 1L, ncol = nc)
data2 <- matrix(0, nrow = Lag + 1L, ncol = nc - 1L)
for(i in 1:(nr - Lag)) {
corr2 <- matrix(0, nrow = nc, ncol = nc - 1L)
data3[] <- data[i:(i + Lag), ]
for(j in 1:nc) {
corr <- integer(nc - 1L)
data2[] <- data3[, -j]
for(k in 1:(nc - 1L)){
res <- tryCatch(
grangertest(x = data2[, k], y = data3[, j], order = 1, na.action = na.omit),
error = function(e) e
)
if(!inherits(res, "error") && !is.na(res[['Pr(>F)']][2]) && res[['Pr(>F)']][2] <= 0.05) {
corr[k] <- 1L
}
}
corr2[j, ] <- corr
}
corr3[i, ] <- rowSums(corr2)
}
corr3
}
The results are identical and the timings much better.
t1 <- system.time({
res <- f_Rui(data, 5L)
})
identical(corr3, res)
#[1] TRUE
times <- rbind(t0, t1)
t(t(times)/t1)
# user.self sys.self elapsed user.child sys.child
#t0 4.682908 1.736111 4.707783 NaN NaN
#t1 1.000000 1.000000 1.000000 NaN NaN

Fast random sampling from matrix of cumulative probability mass functions in R

I have a matrix (mat_cdf) representing the cumulative probability an individual in census tract i moves to census tract j on a given day. Given a vector of agents who decide not to "stay home", I have a function, GetCTMove function below, to randomly sample from this matrix to determine which census tract they will spend time in.
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
# Matrix where rows correspond to probability mass of column integer
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
# Convert to cumulative probability mass
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
# Create cdf for row i
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
GetCTMove <- function(agent_cts, ct_mat_cdf){
# Expand such that every agent has its own row corresponding to CDF of movement from their home ct i to j
mat_expand <- ct_mat_cdf[agent_cts,]
# Probabilistically sample column index for every row by generating random number and then determining corresponding closest column
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Sample of 500,000 agents' residence ct
agents <- sample(1:cts, size = 500000, replace = T)
# Run function
system.time(GetCTMove(agents, mat_cdf))
user system elapsed
3.09 1.19 4.30
Working with 1 million agents, each sample takes ~10 seconds to run, multiplied by many time steps leads to hours for each simulation, and this function is by far the rate limiting factor of the model. I'm wondering if anyone has advice on faster implementation of this kind of random sampling. I've used the dqrng package to speed up random number generation, but that's relatively miniscule in comparison to the matrix expansion (mat_expand) and max.col calls which take longest to run.

The first thing that you can optimise is the following code:
max.col(s < mat_expand, "first")
Since s < mat_expand returns a logical matrix, applying the max.col function is the same as getting the first TRUE in each row. In this case, using which will be much more efficient. Also, as shown below, you store all your CDFs in a matrix.
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
This structure may not be optimal. A list structure is better for applying functions like which. It is also faster to run as you do not have to go through a do.call(rbind, ...).
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
Below is your implementation:
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
On my desktop, it takes about 2.68s to run.
> system.time(GetCTMove(agents, mat_cdf))
user system elapsed
2.25 0.41 2.68
With a list structure and a which function, the run time can be reduced by about 1s.
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
> system.time(GetCTMove2(agents, ls_cdf))
user system elapsed
1.59 0.02 1.64
To my knowledge, with R only there is no other way to further speed up the code. However, you can indeed improve the performance by re-writing the key function GetCTMove in C++. With the Rcpp package, you can do something as follows:
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
This implementation is lightning fast, which should fulfil all your needs.
> system.time(GetCTMove3(agents, ls_cdf))
user system elapsed
0.07 0.00 0.06
The full script is attached as follows:
# Random generation
cts <- 500
i <- rgamma(cts, 50, 1)
prop <- 1:cts
agents <- sample(1:cts, size = 500000, replace = T)
# using a list structure to speed up the creation of cdfs
ls_cdf <- lapply(i, function(x) cumsum(dpois(prop, x)))
# below is your code
mat <- do.call(rbind, lapply(i, function(i){dpois(prop, i)}))
mat_cdf <- matrix(NA, cts, cts)
for(i in 1:cts){
mat_cdf[i,] <- sapply(1:cts, function(j) sum(mat[i,1:j]))
}
# Implementation 1
GetCTMove <- function(agent_cts, ct_mat_cdf){
mat_expand <- ct_mat_cdf[agent_cts,]
s <- runif(length(agent_cts))
fin_col <- max.col(s < mat_expand, "first")
return(fin_col)
}
# Implementation 2
GetCTMove2 <- function(agent_cts, ls_cdf){
n <- length(agent_cts)
s <- runif(n)
out <- integer(n)
i <- 1L
while (i <= n) {
out[[i]] <- which(s[[i]] < ls_cdf[[agent_cts[[i]]]])[[1L]]
i <- i + 1L
}
out
}
# Implementation 3
Rcpp::cppFunction('NumericVector fast_GetCTMove(NumericVector agents, NumericVector s, List cdfs) {
int n = agents.size();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
NumericVector cdf = as<NumericVector>(cdfs[agents[i] - 1]);
int m = cdf.size();
for (int j = 0; j < m; ++j) {
if (s[i] < cdf[j]) {
out[i] = j + 1;
break;
}
}
}
return out;
}')
GetCTMove3 <- function(agent_cts, ls_cdf){
s <- runif(length(agent_cts))
fast_GetCTMove(agent_cts, s, ls_cdf)
}
system.time(GetCTMove(agents, mat_cdf))
system.time(GetCTMove2(agents, ls_cdf))
system.time(GetCTMove3(agents, ls_cdf))

How to boost up a for-loop

I am suffering from slow for-loop execution in R. Here I provide a part of my code which is producing delay.
## subsitutes for original data
DC <- matrix(rnorm(10), ncol=101, nrow=6400)
C <- matrix(rnorm(20), ncol=101, nrow=6400)
N <- 80
Vcut <- ncol(DC)
V <- seq(-2.9,2.5,length=Vcut)
fNC <- matrix(NA, nrow=(N*N), ncol=Vcut)
fNDC <- matrix(NA, nrow=(N*N), ncol=Vcut)
Arbfunc <- function(dV){
b <- matrix(NA, nrow=1, ncol=Vcut)
for(i in 1:(N*N)) {
for (n in 1:Vcut) {
for (k in 1:Vcut) {
b[k] = (V[2]-V[1])*(exp((-1)*abs(V[k])))*exp(abs(V[n]-V[k])/dV)*(C[i,k]/V[k])
}
fNC[i,n] = exp(1*abs(V[n]))*(1/(2*dV))*(sum(b[]))
fNDC[i,n] = DC[i,n]/fNC[i,n]
}
}
}
Arbfunc(0.5)
Since I need to compare the results among the various values of dV's, this code should run at least within few seconds. But the result is
user system elapsed
40.15 0.03 40.24
which is too much slow for enough comparison. I tried several parallelization methods, but the result was not satisfactory (40 -> 25 secs although I used 11 threads in my pc).
Therefore, my guess is that the bottleneck is this for-loop itself, not a non-parallel code. Could you give me some advice to improve this for-loop or hint for parallelization ? Just a short comment would be grateful.

Big thanks to #Mikko Marttila for correcting functions 3 and 4 and providing the idea for function 5.
R is best approached with vectorized options instead of explicit loops. For instance, the inner loop with k:
for (k in 1:Vcut) {
b[k] = (V[2]-V[1])*(exp((-1)*abs(V[k])))*exp(abs(V[n]-V[k])/dV)*(C[i,k]/V[k])
}
That's the same as saying
(V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(C[i,]/V)
This small change gives us a 500x performance boost for this part of the function:
Unit: microseconds
expr min lq mean median uq max neval
k_loop 13186.7 13603.2 14605.471 13832.9 14517.8 41935.1 100
k_vectorized 16.4 17.6 25.559 28.8 32.0 52.7 100
Now if we look at the outer loop with the i, we see that there's really no need to loop by each row. We could instead make a matrix for the the sum(b[k]) statement turning this:
(V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(C[i,]/V)
Into this:
(V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(t(C)/V)
That just saved us N*N*k loops. In your case, that's 646,400 loops.
To put it altogether, we would have:
Arbfunc3 <- function(dV){
for (n in 1:Vcut) {
sum_b = colSums((V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(t(C)/V))
fNC[, n] = exp(1*abs(V[n]))*(1/(2*dV))*(sum_b)
fNDC[, n] = DC[,n]/fNC[,n]
}
}
My median time for microbenchmark is 750 milliseconds for this alternative.
To further improve performance, we need to address the V[n] - V. Thankfully, the R has a function - outer(V, V, '-') and this will produce a matrix with all combinations we need.
Arbfunc4 <- function(dV) {
sum_b = apply((V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(outer(V, V, '-')) / dV) / V, 2, function(x) colSums(x * t(C)))
fNC = exp(1*abs(V))*(1/(2*dV))*t(sum_b)
fNDC= DC/t(fNC)
fNDC
}
Thanks to #Mikko Marttila for a suggestion to get rid of apply with a dot product.
Arbfunc5 <- function(dV) {
a = (V[2] - V[1]) * exp(-abs(V)) * t(C) / V
b = exp(abs(outer(V, V, "-")) / dV) %*% a
fNC = exp(1*abs(V))*(1/(2*dV))*(b)
fNDC= DC/t(fNC)
fNDC
}
Here is the system.time for each solution (Arbfunc2 is the elimination of the k_loop). The optimized solution is 2,600 times faster than the original.
> system.time(Arbfunc(0.5))
user system elapsed
78.03 0.39 79.72
> system.time(Arbfunc2(0.5))
user system elapsed
10.41 0.03 10.46
> system.time(Arbfunc3(0.5))
user system elapsed
0.69 0.13 0.81
> system.time(Arbfunc4(0.5))
user system elapsed
0.43 0.05 0.47
> system.time(Arbfunc5(0.5))
user system elapsed
0.03 0.00 0.03
Final Edit: Here's the complete code that I ran after restarting R and emptying my environment. No errors:
## subsitutes for original data
DC <- matrix(rnorm(10), ncol=101, nrow=6400)
C <- matrix(rnorm(20), ncol=101, nrow=6400)
N <- 80
Vcut <- ncol(DC)
V <- seq(-2.9,2.5,length=Vcut)
# Unneeded for Arbfunc4 adn Arbfunc5
# Corrected from NA to NA_real_ to prevent coercion from logical to numeric
# h/t to #HenrikB
fNC <- matrix(NA_real_, nrow=(N*N), ncol=Vcut)
fNDC <- matrix(NA_real_, nrow=(N*N), ncol=Vcut)
Arbfunc <- function(dV){
b <- matrix(NA, nrow=1, ncol=Vcut)
for(i in 1:(N*N)) {
for (n in 1:Vcut) {
for (k in 1:Vcut) {
b[k] = (V[2]-V[1])*(exp((-1)*abs(V[k])))*exp(abs(V[n]-V[k])/dV)*(C[i,k]/V[k])
}
fNC[i,n] = exp(1*abs(V[n]))*(1/(2*dV))*(sum(b[]))
fNDC[i,n] = DC[i,n]/fNC[i,n]
}
}
fNDC
}
Arbfunc2 <- function(dV){
b <- matrix(NA, nrow=1, ncol=Vcut)
for(i in 1:(N*N)) {
for (n in 1:Vcut) {
sum_b = sum((V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(C[i,]/V))
fNC[i,n] = exp(1*abs(V[n]))*(1/(2*dV))*(sum_b)
fNDC[i,n] = DC[i,n]/fNC[i,n]
}
}
fNDC
}
Arbfunc3 <- function(dV){
for (n in 1:Vcut) {
sum_b = colSums((V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(V[n]-V)/dV)*(t(C)/V))
fNC[, n] = exp(1*abs(V[n]))*(1/(2*dV))*(sum_b)
fNDC[, n] = DC[,n]/fNC[,n]
}
fNDC
}
Arbfunc4 <- function(dV) {
sum_b = apply((V[2]-V[1])*(exp((-1)*abs(V)))*exp(abs(outer(V, V, '-')) / dV) / V, 2, function(x) colSums(x * t(C)))
fNC = exp(1*abs(V))*(1/(2*dV))*t(sum_b)
DC/t(fNC)
}
Arbfunc5 <- function(dV) {
#h/t to Mikko Marttila for dot product
a = (V[2] - V[1]) * exp(-abs(V)) * t(C) / V
b = exp(abs(outer(V, V, "-")) / dV) %*% a
fNC = exp(1*abs(V))*(1/(2*dV))*(b)
DC/t(fNC)
}
#system.time(res <- Arbfunc(0.5))
system.time(res2 <- Arbfunc2(0.5))
system.time(res3 <- Arbfunc3(0.5))
system.time(res4 <- Arbfunc4(0.5))
system.time(res5 <- Arbfunc5(0.5))
all.equal(res2,res3,res4,res5)
As #HenrikB mentions, the fNC and fNDC initialize as logical matrices. That means we get a performance hit when coercing them to real matrices. Doing it the incorrect is a one-time hit of 1 ms for this dataset but if this coercion were in a loop, it could really add up.
mat_NA_real_ <- function() {
mat = matrix(NA_real_, nrow = 6400, ncol = 101)
mat[1,1] = 1
}
mat_NA <- function() {
mat = matrix(NA, nrow = 6400, ncol = 101)
mat[1,1] = 1
}
microbenchmark(mat_NA_real_(), mat_NA())
Unit: microseconds
expr min lq mean median uq max neval
mat_NA_real_() 979.5 992.25 1490.081 998.65 1021.1 7612.5 100
mat_NA() 1865.8 1883.30 3793.119 1911.30 5335.4 53635.2 100

Why isn't this foreach/dopar function using all the cores I asked for?

I'm trying to use doMC with foreach and %dopar%. Here is the function:
doTheMath_MC <- function(st, end, nd) {
print(getDoParWorkers())
if (st > end) stop("end must be larger than st")
# Helper function from stackoverflow.com/a/23158178/633251
tr <- function(x, prec = 0) trunc(x * 10^prec) / 10^prec
# Function to use with foreach
fef <- function(i, j, num, trpi) {
if (num[j] >= num[i]) return(NULL)
val <- num[i]/num[j]
if (!tr(val, nd) == trpi) return(NULL)
return(c(i, j, tr(val, nd)))
}
# Here we go...
nd <- nd - 1
trpi <- tr(pi, nd)
num <- st:end
ni <- length(num)
ans <- foreach(i = 1:ni, .combine = rbind) %:%
foreach(j = 1:ni, .combine = rbind) %dopar% {
fef(i, j, num, trpi)
}
cat("Done computing", paste("EST", st, end, nd+1, sep = "_"), "\n")
if (is.null(ans)) return(NULL)
ans <- as.matrix(na.omit(ans)) # probably not needed in MC version
return(ans) # c("num", "den", "est", "eff")
}
I've previously set up the cores and another function calls the function above (this info posted below, I don't think it is the problem). getDoParWorkers() reports that 7 cores have been assigned as intended. The cat statement verifies that the 2 'loops' are working correctly as far as output goes. However, only 1 core is used. Anyone see why? Mac OSX 10.10.2 and R 3.2 (2015-03-15 r67992). Finally, using doParallel to control everything gives the same result.
The steps which set up everything:
mn <- 1
mx <- 10000
jmp <- 1000
mc <- TRUE
if (mc) {
require("doMC")
registerDoMC(7)
}
st <- seq(mn -1, mx - jmp, jmp) + 1
end <- seq(mn - 1 + jmp, mx, jmp)
nd <- rep(1:15, each = mx/jmp) # watch the recycling
df <- data.frame(st = st, end = end, nd = nd)
for (i in 1:nrow(df)) {
findEsts(df$st[i], df$end[i], df$nd[i], MC = mc)
}

Sorry to answer my own question! I changed the dopar handling so that only the outer loop is made parallel:
ans <- foreach(i = 1:ni, .combine = rbind) %dopar%
for (j in 1:ni) {
fef(i, j, num, trpi)
}
And, I was simply not asking for enough iterations. For testing, I had been using mx = 10000 and jmp = 1000 (see original question). These were not large enough to trigger parallel processing apparently. Increasing each 10x was necessary to get parallel processing going. Thanks to the commenters!
NOTE: While the code above activates the parallel processing, it does not return the answer correctly. That will be the subject of another question.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

parallel nested foreach loops - r

Related

R: performance issues when computing mutual information matrix with NAs

How to minimize the unacceptably long run time of the created R code

Fast random sampling from matrix of cumulative probability mass functions in R

How to boost up a for-loop

Why isn't this foreach/dopar function using all the cores I asked for?

Categories

Resources