I am trying the following code which includes a foreach loop to compute the normalized columns of a matrix A:
library(doParallel)
library(tictoc)
A <- matrix(1.0, 5000, 1000)
cl <- makeCluster(2)
registerDoParallel(cl)
gcinfo(TRUE)
tic()
res1 <- foreach(i=1:nrow(A), .combine='rbind') %dopar% (A[i,]/mean(A[i,]))
toc()
gcinfo(FALSE)
stopCluster(cl)
from Rstudio, I can see that the size of the matrix A is ~38Mb. But when I run the script above, I find that the garbage collector reports the following values:
36.6 Mbytes of cons cells used (56%)
982.1 Mbytes of vectors used (33%)
what I don't grasp clearly is where all memory was spent. In fact, the code above runs faster with a single worker (%do%) rather than with 2 workers (%dopar%). Do you know the reason for the large memory usage of this script?
Thanks
Related
I have R scrip that simulates ARIMA data and check the same data 100 times for ARIMA order ARIMA(p, d, q). I have 2 core on the system CPU, how can I give an R command for a core to compute 1 to 50 while the second core to compute 51 to 100 simultaneously and then combine the result so that.
library(forecast)
system.time({
for (i in 1:100) {
a <- arima.sim(n = 50, model=list(ar = 0.8), sd = 1)
b <- arimaorder(auto.arima(b, ic = "aicc"))
#print(b)
}
I am using windows 10 64 bits
I use foreach and doParallel libraries to divide for loop into many parts.
I believe processing is better for the computer to decide how to divide the loops between the available cores.
#…
library(parallel)
library(foreach)
library(doParallel)
#detectCores() ### Count number of cores available
numCores <- 2
registerDoParallel(numCores)
#for (i in 1:100) { ### Original For loop
foreach(i = 1:100) %dopar% { ### Replacement parallel foreach loop
#…
}
#…
Firstly, I would like to say that I am new to this topic.
Secondly, although I read a lot about Parallel processing in R, I'm still not confident about it.
I just invented simulation in R. So can someone help me with this invented code to understand Parallel processing? (I can see how it works)
My code as follows (Large Random numbers)
SimulateFn<-function(B,n){
M1=list()
for (i in 1:B){
M1[i]=(n^2)}
return(M1)}
SimulateFn(100000000,300000)
Could you please help me?
First of all, parallelization is the procedure of dividing a task into sub tasks, which are simultaneously processed by multiple processors or cores and can be independent or share some dependency between them - the latter case needs more planning and attention.
This procedure has some overhead to shedule subtasks - like copying data to each processor. That said, parallelization is worthless for fast computations. In your example, the threee main procedures are indexing ([), assignment (<-), and a (fast) math operation (^). The overhead for paralellization may be greater than the time to execute the subtask, so in that case parallelization can result in poorer performance!
Despite that, simple parallelization in R is fairly easy. An approach to parallelize your task is provided below, using the doParallel package. Other approachs include using packages as parallel.
library(doParallel)
## choose number of processors/cores
cl <- makeCluster(2)
registerDoParallel(cl)
## register elapsed time to evaluate code snippet
## %dopar% execute code in parallale
B <- 100000; n <- 300000
ptime <- system.time({
M1=list()
foreach(i=1:B) %dopar% {
M1[i]=(n^2)
}
})
## %do% execute sequentially
stime <- system.time({
M1=list()
foreach(i=1:B) %do% {
M1[i]=(n^2)
}
})
The elapsed times on my computer (2 core) were 59.472 and 44.932, respectively. Clearly, there were no improvement by parallelization: indeed, performance was worse!
A better example is shown below, where the main task is much more expensive in terms of computation need:
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
And elapsed times were 24.709 and 34.502: a gain of 28%.
Clusterception
I need to run an expensive algorithm 320 times. I can easily parallelize it over 4 clusters running each 80 iterations. I have access to a machine with 32 cores, so I want to further parallelize the problem. Parallelizing the groups of 80 iterations is trickier, but possible. My idea is to run 8 sub-clusters on each main cluster, each processing 10 iterations.
To test the idea, I have implemented the following dummy code using the parallel package in R. I have tested it on 4 cores and was surprised with the results.
Normal variables generator
Method 1: Parallelize once over 2 clusters
library(parallel)
library(tictoc)
N <- 6*10^8 # total number of observations to generate
n.threads <- 2
sample.sizes <- rep(round(N/n.threads, 0), n.threads) # Each cluster generates half of the sample
tic()
cl <- makeCluster(n.threads)
list <- parLapply(cl, sample.sizes, rnorm)
stopCluster(cl)
v <- unlist(list)
toc()
rm(list, v); gc()
36 sec exec time; 50% CPU usage
Method 2: Parallelize first over 2 clusters, and each cluster is further parallelized over 2 clusters
library(parallel)
library(tictoc)
N <- 6*10^8 # total number of observations to generate
rnorm.parallel <- function(N.inside){
n.threads.inside <- 2
sample.sizes <- rep(round(N.inside/n.threads.inside, 0), n.threads.inside) # each sub thread generates 1*10^8 obs
cl.inside <- makeCluster(n.threads.inside)
list <- parLapply(cl.inside, sample.sizes, rnorm)
stopCluster(cl.inside)
v <- unlist(list)
return(v)
}
n.threads <- 2
sample.sizes <- rep(round(N/n.threads, 0), n.threads) # each main thread generates 2*10^8 obs
tic()
cl <- makeCluster(length(sample.sizes))
clusterEvalQ(cl, library(parallel))
list <- parLapply(cl, sample.sizes, rnorm.parallel)
stopCluster(cl)
v <- unlist(list)
toc()
rm(list, v); gc()
42 sec exec time; 100% CPU usage
My conclusion is that although the technique of running clusters inside a cluster works, the additional overhead makes it less efficient. Is there another package/technique that could help me?
I am having issues with iterations over nleqslv in R, in that the solver does not appear to clean up memory used in previous iterations. I've isolated the issue in a small sample of code:
library(nleqslv)
cons_ext_test <- function(x){
rows_x <- length(x)/2
x_1 <- x[1:rows_x]
x_2 <- x[(rows_x+1):(rows_x*2)]
eq1<- x_1-100
eq2<-x_2*10-40
return(c(eq1,eq2))
}
model_test <- function()
{
reserves<-(c(0:200)/200)^(2)*2000
lambda <- numeric(NROW(reserves))+5
res_ext <- pmin((reserves*.5),5)
x_test <- c(res_ext,lambda)
#print(x_test)
for(test_iter in c(1:1000))
nleqslv(x_test,cons_ext_test,jacobian=NULL)
i<- sort( sapply(ls(),function(x){object.size(get(x))}))
print(i[(NROW(i)-5):NROW(i)])
}
model_test()
When I run this over 1000 iterations, memory use ramps up to over 2 GB:
While running it with 10 iterations uses far less memory, only 92MB:
Running it once has my rsession with 62Mb of use, so growth in memory allocation scales with iterations.
Even after 1000 iterations, with 2+ GB of memory used by the R session, no large-sized objects are listed.
test_iter lambda res_ext reserves x_test
48 1648 1648 1648 3256
Any help would be much appreciated.
AJL
I'm attempting to "map" a function onto an array. However when trying both simple and complex functions, the parallel version is always slower than the serial version. How can I improve the performance of a parallel computation in R?
Simple parallel example:
library(parallel)
# Number of elements
arrayLength = 100
# Create data
input = 1:arrayLength
# A simple computation
foo = function(x, y) x^y - x^(y-1)
# Add complexity
iterations = 5 * 1000 * 1000
# Perform complex computation on each element
compute = function (x) {
y = x
for (i in 1:iterations) {
x = foo(x, y)
}
return(x)
}
# Parallelized compute
computeParallel = function(x) {
# Create a cluster with 1 fewer cores than are available.
cl <- makeCluster(detectCores() - 1) # 8-1 cores
# Send static vars & funcs to all cores
clusterExport(cl, c('foo', 'iterations'))
# Map
out = parSapply(cl, x, compute)
# Clean up
stopCluster(cl)
return(out)
}
system.time(out <- compute(input)) # 12 seconds using 25% of cpu
system.time(out <- computeParallel(input)) # 160 seconds using 100% of cpu
The problem is that you traded off all of the vectorization for parallelization, and that's a bad trade. You need to keep as much vectorization as possible to have any hope of getting an improvement with parallelization for this kind of problem.
The pvec function in the parallel package can be a good solution to this kind of problem, but it isn't supported in parallel on Windows. A more general solution which works on Windows is to use foreach with the itertools package which contains functions which are useful for iterating over various objects. Here's an example that uses the "isplitVector" function to create one subvector for each worker:
library(doParallel)
library(itertools)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
computeChunk <- function(x) {
foreach(xc=isplitVector(x, chunks=getDoParWorkers()),
.export=c('foo', 'iterations', 'compute'),
.combine='c') %dopar% {
compute(xc)
}
}
This still may not compare very well to the pure vector version, but it should get better as the value of "iterations" increases. It may actually help to decrease the number of workers unless the value of "iterations" is very large.
parSapply will run the function on each element of input separately, which means you are giving up the speed you gained from writing foo and compute in a vectorized fashion.
pvec will run a vectorized function on multiple cores by chunks. Try this:
system.time(out <- pvec(input, compute, mc.cores=4))