Parallel memory duplication/usage in R? - r

I have written a program to generate a very large amount of random multivariate distributed data (25 x 30 x 10 000 000) using mvtnorm, then do some simple calculations and manipulations on the matrices.
I am using the foreach and doParallel packages to run operations in parallel to reduce time. A completely arbitrary example, just to demonstrate the packages is:
foreach (x = matr) %dopar% {
x[time_horizon + 1] <- x[time_horizon]
x <- cbind(100,x)
for (m in 2:(time_horizon + 1)) {
# loop through each row of matrix to apply function
x[,m] <- x[,m-1] + x[,m]
}
return(x)
}
I have created an implicit cluster of cores to run these foreach functions on:
registerDoParallel(4)
The problem
When I run with multiple cores, it appears to multiply or duplicate the memory used when I monitor performance on Task Manager (i.e. 2 cores uses more memory than 1 core, 4 cores uses more memory than 2).
When I run my program for (25 x 30 x 1 000 000), running in parallel helps the speed of execution (i.e. 4 cores is faster than 1 core). However, when I run my program for (25 x 30 x 2 500 000) and above, too much memory is used and that appears to slow it down.
One friend said it could potentially be a page fault and the hard drive must be accessed when I run out of RAM.
Why is the duplication of memory across cores happening? Is it supposed to happen? Can I stop it? Are there other solutions?
Edit (Full Code):
library(mvtnorm)
library(foreach)
library(doParallel)
library(ggplot2)
library(reshape2)
library(plyr)
# Calculate the number of cores
no_cores <- detectCores()
# Create an implicit cluster and regular cluster
registerDoParallel(no_cores)
daily_pnl <- function() {
time_horizon <- 30
paths <- 2500000
asset <- 25
path_split <- 100
corr_mat <- diag(asset)
expected_returns <- runif(asset,0.0, 0.25)
# Create a list of vectors to store pnl information for each asset
foreach(icount(time_horizon), .packages = "mvtnorm") %dopar% {
average_matrix <- matrix(, (paths/path_split), asset)
split_start <- 1
my_day <- rmvnorm(paths, expected_returns, corr_mat, method="chol")
for (n in 1:(paths/path_split)) {
average_matrix[n,] <- colMeans(my_day[split_start:(split_start + path_split - 1),])
split_start <- split_start + path_split
}
return(average_matrix)
}
}
matrix_splitter <- function(matr) {
time_horizon <- 30
paths <- 2500000
path_split <- 100
asset <- 25
alply(array(unlist(daily), c(paths/path_split,time_horizon,asset)),3)
}
cum_returns <- function(matr) {
time_horizon <- 30
paths <- 2500000
asset <- 25
foreach (x = matr) %dopar% {
x[time_horizon + 1] <- x[time_horizon]
x <- cbind(100,x)
for (m in 2:(time_horizon + 1)) {
# loop through each row of matrix to apply function
x[,m] <- x[,m-1] + x[,m]
}
return(x)
}
}
plotting <- function(path_matr) {
security_paths <- as.data.frame(t(path_matr))
security_paths$id <- 1:nrow(security_paths)
plot_paths <- melt(security_paths, id.var="id")
ggplot(plot_paths, aes(x=id, y=value,group=variable,colour=variable)) +
geom_line(aes(lty=variable))
}
system.time(daily <- daily_pnl())
system.time(daily_by_security <- matrix_splitter(daily))
rm(daily)
gc()
system.time(security_paths <- cum_returns(daily_by_security))
rm(daily_by_security)
gc()
plot_list <- foreach(x = security_paths, .packages = c("reshape2", "ggplot2")) %dopar% {
if (nrow(x) > 100) {
plotting(head(x,100))
} else {
plotting(x)
}
}
#Stop implicit cluster and regular cluster
stopImplicitCluster()
gc()

This seems to be a really old problem. I am having a similar issue. I don't need compute parallelization I actually need memory parallelization. (if such a thing can exist)
what works for me is azure do parallel. instead of registering system cores register cores from the cloud using registerDoAzureParallel(cluster)
your json will define the size of the machines (memory) you hire for the job. make sure each worker has enough memory to get a copy of your r environment. This will probably kill your network. You will be sending data to 30 -40 (depending on how many you have asked for) workers from your machine.
more documentation here.
https://github.com/Azure/doAzureParallel
Can we do something with sparklyr to address such issues?

Related

Parallelized `Find` loop in R

There are several packages in R to simplify running code in parallel, like foreach and future. Most of these have constructs which are like lapply or a for loop: they carry on until all the tasks have finished.
Is there a simple parallel version of Find? That is, I would like to run several tasks in parallel. I don't need all of them to finish, I just need to get the first one that finishes (maybe with a particular result). After that the other tasks can be killed, or left to finish on their own.
Conceptual code:
hunt_needle <- function (x, y) x %in% (y-1000):y
x <- sample.int(1000000, 1)
result <- parallel_find(seq(1000, 1000000, 1000), hunt_needle)
# should return the first value for which hunt_needle is true
You can use shared memory so that processes can communicate with one another.
For that, you can use package bigstatsr (disclaimer: I'm the author).
Choose a block size and do:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
# Data example
cond <- logical(1e6)
cond[sample(length(cond), size = 1)] <- TRUE
ind.block <- bigstatsr:::CutBySize(length(cond), block.size = 1000)
cl <- parallel::makeCluster(nb_cores())
doParallel::registerDoParallel(cl)
# This value (in an on-disk matrix) is shared by processes
found_it <- FBM(1, 1, type = "integer", init = 0L)
library(foreach)
res <- foreach(ic = sample(rows_along(ind.block)), .combine = 'c') %dopar% {
if (found_it[1]) return(NULL)
ind <- bigstatsr:::seq2(ind.block[ic, ])
find <- which(cond[ind])
if (length(find)) {
found_it[1] <- 1L
return(ind[find[1]])
} else {
return(NULL)
}
}
parallel::stopCluster(cl)
# Verification
all.equal(res, which(cond))
Basically, when a solution is found, you don't need to do some computations anymore, and others know it because you put a 1 in found_it which is shared between all processes.
As your question is not reproducible and I don't understand everything you need, you may have to adapt this solution a little bit.

Parallel process in chunks giving no performance benefits

I have a very huge list ( huge_list ) . A function (inner_fun) is called for each value of the list. Inner_fun takes around .5 seconds.output of inner_fun is a simple numeric vector of size 3. I am trying to parallelise this approach. After going through many articles , it was mentioned that it is better to divide in chunks when the parallel function is very quick. So i divided it based on cores. But there is no time benefit. I am not able to understand the concept here . Can anyone give few insights on this. My major concern is that if i am doing something wrong with the code. I am not posting exact codes here. but i have tried to replicate as much as possible
few observations :
dummy_fun and dummy_fun2 takes around 10 hrs with parallel kept as
11
with no parallel , this goes around 20 hrs.
with parallel=2 ,it completes in 15 hrs
I am using 12 cores , 60 GB RAM , ubuntu machine
Code to make cluster
no_of_clusters<-detectCores()-1
cl <- makeCluster(no_of_clusters) ; registerDoParallel(cl) ;
clusterExport(cl, varlist=c("arg1","arg2","inner_fun"))
Function without chunks
dummy_fun<-function(arg1,arg2,huge_list){
g <- foreach (i = 1: length(huge_list),.combine=rbind,
.multicombine=TRUE) %dopar% {
inner_fun(i,arg1,arg2,huge_list[i])
}
return(g)
}
**Functions with chunks **
dummy_fun2<-function(arg1,arg2,huge_list){
il<-1:length(huge_list)
il2<-split(il, ceiling(seq_along(il)/(length(il)/(detectCores()-1))))
g <- foreach ( i= il2 , .combine=rbind,.multicombine=TRUE) %dopar% {
ab1<-lapply(i,function(li)
{
inner_fun(i,arg1,arg2,huge_list(i))
}
)
do.call(rbind,ab1)
}
return(g)
}
You got the chunks wrong. It's not about dividing the indices in chunks of length no_of_clusters but rather to divide them in no_of_clusters chunks.
Try this out:
dummy_fun2 <- function(arg1, arg2, huge_list, inner_fun, ncores) {
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
on.exit(parallel::stopCluster(cl), add = TRUE)
L <- length(huge_list)
inds <- split(seq_len(L), sort(rep_len(seq_len(NCORES), L)))
foreach(l = seq_along(inds), .combine = rbind) %dopar% {
ab1 <- lapply(inds[[l]], function(i) {
inner_fun(i, arg1, arg2, huge_list[i])
})
do.call(rbind, ab1)
}
}
Further remarks:
it's often useless to use more than half of the cores you have on your computer.
the option .multicombine is automatically used with rbind. But the .maxcombine is really important (need more than 100). Here, we use lapply for the sequential part so this remark doesn't matter.
it's useless to have many exports when using foreach, it already exports what is necessary from the environment of dummy_fun2.
are you sure you want to use huge_list[i] (get a list of one element) rather than huge_list[[i]] (get the i-th element of the list)?

Identify if in parallel call

I have a generic chunking function that breaks big calls into smaller pieces and runs them in parallel.
chunk_it <- function(d, n, some_fun) {
# run n chunks of d in parallel
dat <- foreach(...) %doPar% {
some_fun(...)
}
}
I want to make it so that this generic chunking function can identify if it's being called by a process that's already in parallel (chunked in my terminology)
chunked_highlevel <- function(d, n, some_fun) {
# run n chunks of d in parallel
...
chunk_it(lowerlevel_d, n) # do not chunk!
}
What I would like to happen here is that if I have chunked the process at a higher level, that it does not activate the chunking function at the lower level.
Is there a way to identify when you're already inside a parallel process?
So, that we could code like this:
chunk_it <- function(d, n, some_fun) {
# run n chunks of d in parallel
if(!already_parallel) {
dat <- foreach(...) %doPar% {
some_fun(...)
}
} else {
dat <- some_fun()
}
}
I don't think there's an official way of doing this. However, in general there should be code evident in the call stack which makes it obvious whether you're in parallel code. What I've got so far looks like this. It seems to work for doSNOW with either MPI or SOCK, but will probably need adjustment for other packages that implement %dopar%. It's also dependent on some internal details of snow which may be subject to change in future versions.
library(doSNOW)
library(foreach)
my_fn <- function(bit) {
is_parallel <- any(unlist(lapply(sys.calls(), function(cal) {
as.character(cal[[1]]) %in% c("slaveLoop", "%dopar%")
})))
is_parallel
}
foreach(x = 1:2) %do% my_fn(x)
# [[1]]
# [1] FALSE
#
# [[2]]
# [1] FALSE
cl <- makeCluster(2)
registerDoSNOW()
foreach(x = 1:2) %dopar% my_fn(x)
# [[1]]
# [1] TRUE
#
# [[2]]
# [1] TRUE
The future package (I'm the author) has built in support for nested parallelism so that you do not have to worry about it as a developer while still giving the end user full power to control how and where parallelization is taking place.
Here's an example from one of the future vignettes:
library("future")
library("listenv")
x <- listenv()
for (ii in 1:3) {
x[[ii]] %<-% {
y <- listenv()
for (jj in 1:3) {
y[[jj]] %<-% { ii + jj/10 }
}
y
}
}
unlist(x)
## [1] 1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3
Note how there are two-layers of future assignments (%<-%). The default is to always process them sequentially unless specificiation says otherwise. For instance, to process the outer loop of future assignments in parallel on your local machine, use:
plan(multiprocess)
This will cause x[[ii]] %<-% { ... } for ii = 1, 2, 3 to run in parallel, while the contained y[[jj]] %<-% { ... } will run sequentially. The equivalent fully explicit setting for this is:
plan(list(multiprocess, sequential))
Now, if you want to run the outer loop of futures (x[[ii]]) sequentially and the inner loop of futures (y[[jj]]) in parallel, you can specify:
plan(list(sequential, multiprocess))
before running the code.
BTW, the number of parallel processes used with multiprocess is future::availableCores(). Think of it as parallel::detectCores() but that is also agile to mc.cores, HPC cluster environments etc. Importantly, future::availableCores() will return 1 if it's already running in parallel ("is a parallel child"). This means that if you do:
plan(list(multiprocess, multiprocess))
the inner layer of futures will actually only see a single core. You can think of this as a built-in automatic protection from creating a huge number of parallel processes by mistake through recursive parallelism.
You can force a different setting though (but not recommended). For instance, say you want the outer layer to run four parallel tasks at the same time, and each of those tasks to run two parallel tasks at the same time (on your local machine), then you can use:
plan(list(
tweak(multiprocess, workers = 4L),
tweak(multiprocess, workers = 2L)
))
That will run at most 4*2 = 8 parallel tasks at the same time (plus the master process).
If you have a set of machines available, you can do:
plan(list(
tweak(cliuster, workers = c("machine1", "machine2", "machine3")),
multiprocess
))
that will distribute the outer layer of futures (x[[ii]]) to those three machines, and the inner layer of futures (y[[ii]]) will run in parallel using all the available cores on those machines.
Note how the code doesn't change - only the settings (= plan() call). This is in the spirit of "write once, run wherever". There are many different future-strategy setups you can use; see the vignettes of the future package.
Now, what if you wanna use foreach()? You can use the doFuture %dopar% adapter that works on top of the future framework. For example,
library("doFuture")
registerDoFuture()
some_fun <- function(j) {
list(j = j, pid.j = Sys.getpid())
}
my_fun <- function(i) {
y <- foreach(j = 1:3) %dopar% { some_fun(j = j) }
list(i = i, pid.i = Sys.getpid(), y = y)
}
x <- foreach(i = 1:3) %dopar% { my_fun(i = i) }
Run the above and look at str(x) and its different PIDs for the different plan():s exemplified above. That'll illustrate what's going on.
Hope this helps

Shared memory in parallel foreach in R

Problem Description:
I have a big matrix c, loaded in RAM memory. My goal is through parallel processing to have read only access to it. However when I create the connections either I use doSNOW, doMPI, big.matrix, etc the amount to ram used increases dramatically.
Is there a way to properly create a shared memory, where all the processes may read from, without creating a local copy of all the data?
Example:
libs<-function(libraries){# Installs missing libraries and then load them
for (lib in libraries){
if( !is.element(lib, .packages(all.available = TRUE)) ) {
install.packages(lib)
}
library(lib,character.only = TRUE)
}
}
libra<-list("foreach","parallel","doSNOW","bigmemory")
libs(libra)
#create a matrix of size 1GB aproximatelly
c<-matrix(runif(10000^2),10000,10000)
#convert it to bigmatrix
x<-as.big.matrix(c)
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
out<-foreach(linID = 1:10, .combine=c) %dopar% {
#load bigmemory
require(bigmemory)
# attach the matrix via shared memory??
m <- attach.big.matrix(mdesc)
#dummy expression to test data aquisition
c<-m[1,1]
}
closeAllConnections()
RAM:
in the image above, you may find that the memory increases a lot until foreach ends and it is freed.
I think the solution to the problem can be seen from the post of Steve Weston, the author of the foreach package, here. There he states:
The doParallel package will auto-export variables to the workers that are referenced in the foreach loop.
So I think the problem is that in your code your big matrix c is referenced in the assignment c<-m[1,1]. Just try xyz <- m[1,1] instead and see what happens.
Here is an example with a file-backed big.matrix:
#create a matrix of size 1GB aproximatelly
n <- 10000
m <- 10000
c <- matrix(runif(n*m),n,m)
#convert it to bigmatrix
x <- as.big.matrix(x = c, type = "double",
separated = FALSE,
backingfile = "example.bin",
descriptorfile = "example.desc")
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
## 1) No referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
## 2) Referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
invisible(c) ## c is referenced and thus exported to workers
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
closeAllConnections()
Alternatively, if you are on Linux/Mac and you want a CoW shared memory, use forks. First load all your data into the main thread, and then launch working threads (forks) with general function mcparallel from the parallel package.
You can collect their results with mccollect or with the use of truly shared memory using the Rdsm library, like this:
library(parallel)
library(bigmemory) #for shared variables
shared<-bigmemory::big.matrix(nrow = size, ncol = 1, type = 'double')
shared[1]<-1 #Init shared memory with some number
job<-mcparallel({shared[1]<-23}) #...change it in another forked thread
shared[1,1] #...and confirm that it gets changed
# [1] 23
You can confirm, that the value really gets updated in backgruound, if you delay the write:
fn<-function()
{
Sys.sleep(1) #One second delay
shared[1]<-11
}
job<-mcparallel(fn())
shared[1] #Execute immediately after last command
# [1] 23
aaa[1,1] #Execute after one second
# [1] 11
mccollect() #To destroy all forked processes (and possibly collect their output)
To control for concurency and avoid race conditions use locks:
library(synchronicity) #for locks
m<-boost.mutex() #Lets create a mutex "m"
bad.incr<-function() #This function doesn't protect the shared resource with locks:
{
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
}
good.incr<-function()
{
lock(m)
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
unlock(m)
}
shared[1]<-1
for (i in 1:5) job<-mcparallel(bad.incr())
shared[1] #You can verify, that the value didn't get increased 5 times due to race conditions
mccollect() #To clear all threads, not to get the values
shared[1]<-1
for (i in 1:5) job<-mcparallel(good.incr())
shared[1] #As expected, eventualy after 5 seconds of waiting you get the 6
#[1] 6
mccollect()
Edit:
I simplified dependencies a bit by exchanging Rdsm::mgrmakevar into bigmemory::big.matrix. mgrmakevar internally calls big.matrix anyway, and we don't need anything more.

Parallel RJAGS with convergence testing

I'm modifying an existing model using RJAGS. I'd like to run chains in parallel, and occasionally check the Gelman-Rubin convergence diagnostic to see if I need to keep running. The problem is, if I need to resume running based on the diagnostic value, the recompiled chains restart from the first initialized prior values and not the position in parameter space where the chain stopped. If I do not recompile the model, RJAGS complains. Is there a way to store the positions of the chains when they stop so I can re-initialize from where I left off? Here I'll give a very simplified example.
example1.bug:
model {
for (i in 1:N) {
x[i] ~ dnorm(mu,tau)
}
mu ~ dnorm(0,0.0001)
tau <- pow(sigma,-2)
sigma ~ dunif(0,100)
}
parallel_test.R:
#Make some fake data
N <- 1000
x <- rnorm(N,0,5)
write.table(x,
file='example1.data',
row.names=FALSE,
col.names=FALSE)
library('rjags')
library('doParallel')
library('random')
nchains <- 4
c1 <- makeCluster(nchains)
registerDoParallel(c1)
jags=list()
for (i in 1:getDoParWorkers()){
jags[[i]] <- jags.model('example1.bug',
data=list('x'=x,'N'=N))
}
# Function to combine multiple mcmc lists into a single one
mcmc.combine <- function( ... ){
return( as.mcmc.list( sapply( list( ... ),mcmc ) ) )
}
#Start with some burn-in
jags.parsamples <- foreach( i=1:getDoParWorkers(),
.inorder=FALSE,
.packages=c('rjags','random'),
.combine='mcmc.combine',
.multicombine=TRUE) %dopar%
{
jags[[i]]$recompile()
update(jags[[i]],100)
jags.samples <- coda.samples(jags[[i]],c('mu','tau'),100)
return(jags.samples)
}
#Check the diagnostic output
print(gelman.diag(jags.parsamples[,'mu']))
counter <- 0
#my model doesn't converge so quickly, so let's simulate doing
#this updating 5 times:
#while(gelman.diag(jags.parsamples[,'mu'])[[1]][[2]] > 1.04)
while(counter < 5)
{
counter <- counter + 1
jags.parsamples <- foreach(i=1:getDoParWorkers(),
.inorder=FALSE,
.packages=c('rjags','random'),
.combine='mcmc.combine',
.multicombine=TRUE) %dopar%
{
#Here I lose the progress I've made
jags[[i]]$recompile()
jags.samples <- coda.samples(jags[[i]],c('mu','tau'),100)
return(jags.samples)
}
}
print(gelman.diag(jags.parsamples[,'mu']))
print(summary(jags.parsamples))
stopCluster(c1)
In the output, I see:
Iterations = 1001:2000
where I know there should be > 5000 iterations.
(cross-posted to stats.stackexchange.com, which may be the more appropriate venue)
Every time your JAGS model runs on the worker nodes the coda samples are returned but the state of the model is lost. So next time it recompiles, it restarts from the beginning, as you are seeing. To get around this you need to get and return the state of the model in your function (on the worker nodes) like so:
endstate <- jags[[i]]$state(internal=TRUE)
Then you need to pass this back to the worker node and re-generate the model within the worker function using jags.model() with inits=endstate (for the appropriate chain).
I would actually recommend looking at the runjags package that does all this for you. For example:
library('runjags')
parsamples <- run.jags('example1.bug', data=list('x'=x,'N'=N), monitor=c('mu','tau'), sample=100, method='rjparallel')
summary(parsamples)
newparsamples <- extend.jags(parsamples, sample=100)
summary(parsamples)
# etc
Or even:
parsamples <- autorun.jags('example1.bug', data=list('x'=x,'N'=N), monitor=c('mu','tau'), method='rjparallel')
Version 2 of runjags will hopefully be uploaded to CRAN soon, but for now you can download binaries from: https://sourceforge.net/projects/runjags/files/runjags/
Matt

Resources