Multicore and memory usage in R under Ubuntu

Multicore and memory usage in R under Ubuntu - r

I am running R on an Ubuntu workstation with 8 virtual cores and 8 Gb of ram. I was hoping to routinely use the multicore package to make use of the 8 cores in parallel; however I find that the whole R process becomes duplicated 8 times.
As R actually seems to use much more memory than is reported in gc (by a factor 5, even after gc()), this means that even a relatively mild memory usage (one 200Mb object) becomes intractably memory-heavy once duplicated 8 times.
I looked into bigmemory to have the child processes share the same memory space; but it would require some major rewriting of my code as it doesn't deal with dataframes.
Is there a way to make R as lean as possible before forking, i.e. have the OS reclaim as much memory as possible?
EDIT:
I think I understand what is going on now. The problem is not where I thought it was -- objects that exist in the parent thread and are not manipulated do not get duplicated eight times. Instead my problem, I believe, came from the nature of the manipulation I am making each child process perform. Each has to manipulate a big factor with hundreds of thousands of levels, and I think this is the memory-heavy bit. As a result, it is indeed the case that the overall memory load is proportional to the number of cores; but not as dramatically as I thought.
Another lesson I learned is that with 4 physical cores + possibility of hyperthreading, hyperthreading is actually not typically a good idea for R. The gain is minimal, and the memory cost may be non-trivial. So I'll be working on 4 cores from now on.
For those who would like to experiment, this is the type of code I was running:
# Create data
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) {
sampdata[, letter] <- rnorm(1000000)
}
sampdata$groupid = ceiling(sampdata$id/2)
# Enable multicore
library(multicore)
options(cores=4) # number of cores to distribute the job to
# Actual job
system.time(do.call("cbind",
mclapply(subset(sampdata, select = c(a:z)), function(x) tapply(x, sampdata$groupid, sum))
))

Have you tried data.table?
> system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))
))
user system elapsed
906.157 13.965 928.645
> require(data.table)
> DT = as.data.table(sampdata)
> setkey(DT,groupid)
> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
user system elapsed
186.920 1.056 191.582 # 4.8 times faster
> # massage minor diffs in results...
> ans2$groupid=NULL
> ans2=as.matrix(ans2)
> colnames(ans2)=letters
> rownames(ans1)=NULL
> identical(ans1,ans2)
[1] TRUE
Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]
And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) sampdata[, letter] <- rnorm(1000000)
sampdata$groupid = ceiling(sampdata$id/2)
dim(sampdata)
# [1] 1000000 28
system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)
tapply(x,sampdata$groupid,sum))
))
# user system elapsed
# 224.57 3.62 228.54
DT = as.data.table(sampdata)
setkey(DT,groupid)
system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
# user system elapsed
# 11.23 0.01 11.24 # 20 times faster
# massage minor diffs in results...
ans2[,groupid:=NULL]
ans2[,id:=NULL]
ans2=as.matrix(ans2)
rownames(ans1)=NULL
identical(ans1,ans2)
# [1] TRUE
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.8.2 RODBC_1.3-6

Things I've tried on Ubuntu 64 bit R, ranked in order of success:
Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often
I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.
P.S. I was using data.table, and it seems each child process copies the data.table.

Related

object smaller than memory limit cannot be allocated [duplicate]

This question already has answers here:
R memory management / cannot allocate vector of size n Mb
(9 answers)
Closed 5 years ago.
In the course of vectorizing some simulation code, I've run into a memory issue. I'm using 32 bit R version 2.15.0 (via RStudio version 0.96.122) under Windows XP. My machine has 3.46 GB of RAM.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.0-6 lattice_0.20-6 MASS_7.3-18
loaded via a namespace (and not attached):
[1] grid_2.15.0 tools_2.15.0
Here is a minimal example of the problem:
> memory.limit(3000)
[1] 3000
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1069761 28.6 1710298 45.7 1710298 45.7
Vcells 901466 6.9 21692001 165.5 173386187 1322.9
> N <- 894993
> library(MASS)
> sims <- mvrnorm(n = N, mu = rep(0, 11), Sigma = diag(nrow = 11))
> sims <- mvrnorm(n = N + 1, mu = rep(0, 11), Sigma = diag(nrow = 11))
Error: cannot allocate vector of size 75.1 Mb
(In my application the covariance matrix Sigma is not diagonal, but I get the same error either way.)
I've spent the afternoon reading about memory allocation issues in R (including here, here and here). From what I've read, I get the impression that it's not a matter of the available RAM per se, but of the available continuous address space. Still, 75.1Mb seems pretty small to me.
I'd greatly appreciate any thoughts or suggestions that you might have.

I had the same warning using the raster package.
> my_mask[my_mask[] != 1] <- NA
Error: cannot allocate vector of size 5.4 Gb
The solution is really simple and consist in increasing the storage capacity of R, here the code line:
##To know the current storage capacity
> memory.limit()
[1] 8103
## To increase the storage capacity
> memory.limit(size=56000)
[1] 56000
## I did this to increase my storage capacity to 7GB
Hopefully, this will help you to solve the problem
Cheers

R has gotten to the point where the OS cannot allocate it another 75.1Mb chunk of RAM. That is the size of memory chunk required to do the next sub-operation. It is not a statement about the amount of contiguous RAM required to complete the entire process. By this point, all your available RAM is exhausted but you need more memory to continue and the OS is unable to make more RAM available to R.
Potential solutions to this are manifold. The obvious one is get hold of a 64-bit machine with more RAM. I forget the details but IIRC on 32-bit Windows, any single process can only use a limited amount of RAM (2GB?) and regardless Windows will retain a chunk of memory for itself, so the RAM available to R will be somewhat less than the 3.4Gb you have. On 64-bit Windows R will be able to use more RAM and the maximum amount of RAM you can fit/install will be increased.
If that is not possible, then consider an alternative approach; perhaps do your simulations in batches with the n per batch much smaller than N. That way you can draw a much smaller number of simulations, do whatever you wanted, collect results, then repeat this process until you have done sufficient simulations. You don't show what N is, but I suspect it is big, so try smaller N a number of times to give you N over-all.

gc() can help
saving data as .RData, closing, re-opening R, and loading the RData can help.
see my answer here: https://stackoverflow.com/a/24754706/190791 for more details

does R stop no matter the N value you use? try to use small values and see if it's the mvrnorm function that is the issue or you could simply loop it on subsets. Insert the gc() function in the loop to free some RAM continuously

Parallel processing in R done wrong?

I have a code I am trying to process in parallel using the foreach-package. The code is working but when I run it on a computer with 4 cores it takes about 26 min and when I switch to one with 32 cores, it still takes 13 min to finish. I was wondering whether I am doing something wrong since I am using 8 times as much cores, but only reduce the time by one half. My code looks like this:
no_cores <- detectCores()
cl <- makeCluster(no_cores)
registerDoParallel(cl)
Xenopus_Data <- foreach(b=1:length(newly_populated_vec),.packages = c("raster", "gdistance", "rgdal","sp")) %dopar% { Xenopus_Walk(altdata=altdata,water=water,habitat_suitability=habitat_suitability,max_range_without_water=max_range_without_water,max_range=max_range,slope=slope,Start_Pt=newly_populated_vec[b]) }
stopCluster(cl)
For the computer with 4 cores I get the following time:
Time_of_Start
[1] "2016-07-12 13:07:23 CEST"
Time_of_end
[1] "2016-07-12 13:33:10 CEST"
And for the one with 32 cores:
Time_of_Start
[1] "2016-07-12 14:35:48 CEST"
Time_of_end
[1] "2016-07-12 14:48:08 CEST"
Is this normal ? and if so, does anyone know how to speed it up additionally, maybe using different packages?
Any type of help is greatly appreciated!
EDIT: these are the times I get after applying the corrections as suggested. For 32 cores:
User System elapsed
5.99 40.78 243.97
For 4 cores:
user system elapsed
1.91 0.94 991.71
Note that before, I did the calculation multiple times via some loops, that's why the computation time decreased so drastically, but one can still tell that the difference between the two computers has increased, I believe.

Try this and let me know if your problem is solved:
library(doParallel)
library(foreach)
registerDoParallel(cores=detectCores())
n <- length(newly_populated_vec)
cat("\nN = ", n, " | Parallel workers count = ", getDoParWorkers(), "\n\n", sep="")
t0 <- proc.time()
Xenopus_Data <- foreach(b=1:n,.packages = c("raster", "gdistance", "rgdal","sp"), .combine=rbind) %dopar% {
Xenopus_Walk(
water=water,
altdata=altdata,
habitat_suitability=habitat_suitability,
max_range_without_water=max_range_without_water,
max_range=max_range,
slope=slope,
Start_Pt=newly_populated_vec[b])
}
TIME <- proc.time() - t0
Also, try to monitor the logical cores in your PC/laptop to check if all cores are involved in the computation. (TaskManager for Windows and htop for Linux)
Please also be mindful that doubling the number of cores does not necessarily lead to having a double performance.

initializing parallel chains in rjags

I'm doing some ghetto parallelization in jags through rjags.
I've been using the function parallel.seeds to obtain RNG states to intialize the RNG's (example below). However, I don't understand why multiple integers are returned for each RNG. In the documentation it says that when you intialize .RNG.state is supposed to be a numeric vector with length one.
Furthermore, sometimes when I try to do this R crashes with no error generated. When I give up and just let it generate the seed for the chain on it's own, the model runs fine. Does this mean I am using the wrong .RNG.state? Any insight would be appreciated, as I am planning to scale up this model in the future.
> parallel.seeds("base::BaseRNG", 3)
[[1]]
[[1]]$.RNG.name
[1] "base::Wichmann-Hill"
[[1]]$.RNG.state
[1] 3891 16261 19841
[[2]]
[[2]]$.RNG.name
[1] "base::Marsaglia-Multicarry"
[[2]]$.RNG.state
[1] 408065014 1176110892
[[3]]
[[3]]$.RNG.name
[1] "base::Super-Duper"
[[3]]$.RNG.state
[1] -848274653 175424331

There is a difference between .RNG.seed (which is a vector of length one, and the thing you can specify to jags.model to e.g. ensure MCMC samples are repeatable) and .RNG.state (which is a vector of length depending on the pRNG algorithm). It is possible that these got mixed up in the docs somewhere - can you tell me where you read this so I can make sure it is fixed for JAGS/rjags 4?
Regarding the crashing - some more details would be needed to help you with that I'm afraid. I assume that it is the JAGS model that crashes, and not your R session that terminates, and after the model has been running for a while? A reproducible example would help a lot.
By the way - when you say 'scale up' - if you are planning to make use of > 4 chains I would strongly recommend you load the lecuyer module (see ?parallel.seeds examples at the bottom).
Matt

The documentation is a bit confusing; under ?jags.model we see that .RNG.seed should be a vector of length 1, but parallel.seeds() returns .RNG.state which is usually > 1. The state space for the Mersenne Twister algorithm has 624 integers, and that is the length of the vector when you do
parallel.seeds("base::BaseRNG",4)
to make sure you see all 4 types of RNG. Similarly the state space of the Wichmann-Hill generator has 3 integers, and I'm sure similar research would reveal the state spaces for the other two are longer than 1.
For my own edification I mocked up an example using the LINE data in rjags:
data(LINE)
LINE$model() ## edit and save to line.r
data = LINE$data()
line = jags.model("line.r",data=data)
line.samples <- jags.samples(LINE, c("alpha","beta","sigma"),n.iter=1000)
line.samples
inits = parallel.seeds("base::BaseRNG", 3) # a list of lists
inits[[1]]$tau = 1
inits[[1]]$alpha = 3
inits[[1]]$beta = 1
inits[[2]]$tau = .1
inits[[2]]$alpha = .3
inits[[2]]$beta = .1
inits[[3]]$tau = 10
inits[[3]]$alpha = 10
inits[[3]]$beta = 5
line = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples <- jags.samples(line, c("alpha","beta","sigma"),n.iter=1000)
line2 = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples2 <- jags.samples(line2, c("alpha","beta","sigma"),n.iter=1000)
all(line.samples$alpha-line.samples2$alpha < 0.00000001) ## TRUE
So the results are entirely repeatable, which is cool.
To understand the conditions under which R is crashing, I'd need to know the results of sessionInfo() on your computer, plus more details of the circumstances (e.g. what JAGS model are you running?). I just did:
for (i in 1:100){parallel.seeds("base::BaseRNG",4)}
and my computer didn't crash. For reference:
sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets
# [6] methods base
#
# other attached packages:
# [1] rjags_3-14 coda_0.17-1 mlogit_0.2-4
# [4] maxLik_1.2-4 miscTools_0.6-16 Formula_1.2-1
#
# loaded via a namespace (and not attached):
# [1] grid_3.1.3 lattice_0.20-30 lmtest_0.9-33
# [4] MASS_7.3-39 sandwich_2.3-3 statmod_1.4.21
# [7] tools_3.1.3 zoo_1.7-12
That shows the version of R and rjags that I'm using.

Troubleshooting my confirmatory factor analysis using R [duplicate]

This question already has answers here:
R memory management / cannot allocate vector of size n Mb
(9 answers)
Closed 5 years ago.
In the course of vectorizing some simulation code, I've run into a memory issue. I'm using 32 bit R version 2.15.0 (via RStudio version 0.96.122) under Windows XP. My machine has 3.46 GB of RAM.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.0-6 lattice_0.20-6 MASS_7.3-18
loaded via a namespace (and not attached):
[1] grid_2.15.0 tools_2.15.0
Here is a minimal example of the problem:
> memory.limit(3000)
[1] 3000
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1069761 28.6 1710298 45.7 1710298 45.7
Vcells 901466 6.9 21692001 165.5 173386187 1322.9
> N <- 894993
> library(MASS)
> sims <- mvrnorm(n = N, mu = rep(0, 11), Sigma = diag(nrow = 11))
> sims <- mvrnorm(n = N + 1, mu = rep(0, 11), Sigma = diag(nrow = 11))
Error: cannot allocate vector of size 75.1 Mb
(In my application the covariance matrix Sigma is not diagonal, but I get the same error either way.)
I've spent the afternoon reading about memory allocation issues in R (including here, here and here). From what I've read, I get the impression that it's not a matter of the available RAM per se, but of the available continuous address space. Still, 75.1Mb seems pretty small to me.
I'd greatly appreciate any thoughts or suggestions that you might have.

I had the same warning using the raster package.
> my_mask[my_mask[] != 1] <- NA
Error: cannot allocate vector of size 5.4 Gb
The solution is really simple and consist in increasing the storage capacity of R, here the code line:
##To know the current storage capacity
> memory.limit()
[1] 8103
## To increase the storage capacity
> memory.limit(size=56000)
[1] 56000
## I did this to increase my storage capacity to 7GB
Hopefully, this will help you to solve the problem
Cheers

R has gotten to the point where the OS cannot allocate it another 75.1Mb chunk of RAM. That is the size of memory chunk required to do the next sub-operation. It is not a statement about the amount of contiguous RAM required to complete the entire process. By this point, all your available RAM is exhausted but you need more memory to continue and the OS is unable to make more RAM available to R.
Potential solutions to this are manifold. The obvious one is get hold of a 64-bit machine with more RAM. I forget the details but IIRC on 32-bit Windows, any single process can only use a limited amount of RAM (2GB?) and regardless Windows will retain a chunk of memory for itself, so the RAM available to R will be somewhat less than the 3.4Gb you have. On 64-bit Windows R will be able to use more RAM and the maximum amount of RAM you can fit/install will be increased.
If that is not possible, then consider an alternative approach; perhaps do your simulations in batches with the n per batch much smaller than N. That way you can draw a much smaller number of simulations, do whatever you wanted, collect results, then repeat this process until you have done sufficient simulations. You don't show what N is, but I suspect it is big, so try smaller N a number of times to give you N over-all.

gc() can help
saving data as .RData, closing, re-opening R, and loading the RData can help.
see my answer here: https://stackoverflow.com/a/24754706/190791 for more details

does R stop no matter the N value you use? try to use small values and see if it's the mvrnorm function that is the issue or you could simply loop it on subsets. Insert the gc() function in the loop to free some RAM continuously

Memory Allocation "Error: cannot allocate vector of size 75.1 Mb" [duplicate]

This question already has answers here:
R memory management / cannot allocate vector of size n Mb
(9 answers)
Closed 5 years ago.
In the course of vectorizing some simulation code, I've run into a memory issue. I'm using 32 bit R version 2.15.0 (via RStudio version 0.96.122) under Windows XP. My machine has 3.46 GB of RAM.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.0-6 lattice_0.20-6 MASS_7.3-18
loaded via a namespace (and not attached):
[1] grid_2.15.0 tools_2.15.0
Here is a minimal example of the problem:
> memory.limit(3000)
[1] 3000
> rm(list = ls())
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1069761 28.6 1710298 45.7 1710298 45.7
Vcells 901466 6.9 21692001 165.5 173386187 1322.9
> N <- 894993
> library(MASS)
> sims <- mvrnorm(n = N, mu = rep(0, 11), Sigma = diag(nrow = 11))
> sims <- mvrnorm(n = N + 1, mu = rep(0, 11), Sigma = diag(nrow = 11))
Error: cannot allocate vector of size 75.1 Mb
(In my application the covariance matrix Sigma is not diagonal, but I get the same error either way.)
I've spent the afternoon reading about memory allocation issues in R (including here, here and here). From what I've read, I get the impression that it's not a matter of the available RAM per se, but of the available continuous address space. Still, 75.1Mb seems pretty small to me.
I'd greatly appreciate any thoughts or suggestions that you might have.

I had the same warning using the raster package.
> my_mask[my_mask[] != 1] <- NA
Error: cannot allocate vector of size 5.4 Gb
The solution is really simple and consist in increasing the storage capacity of R, here the code line:
##To know the current storage capacity
> memory.limit()
[1] 8103
## To increase the storage capacity
> memory.limit(size=56000)
[1] 56000
## I did this to increase my storage capacity to 7GB
Hopefully, this will help you to solve the problem
Cheers

R has gotten to the point where the OS cannot allocate it another 75.1Mb chunk of RAM. That is the size of memory chunk required to do the next sub-operation. It is not a statement about the amount of contiguous RAM required to complete the entire process. By this point, all your available RAM is exhausted but you need more memory to continue and the OS is unable to make more RAM available to R.
Potential solutions to this are manifold. The obvious one is get hold of a 64-bit machine with more RAM. I forget the details but IIRC on 32-bit Windows, any single process can only use a limited amount of RAM (2GB?) and regardless Windows will retain a chunk of memory for itself, so the RAM available to R will be somewhat less than the 3.4Gb you have. On 64-bit Windows R will be able to use more RAM and the maximum amount of RAM you can fit/install will be increased.
If that is not possible, then consider an alternative approach; perhaps do your simulations in batches with the n per batch much smaller than N. That way you can draw a much smaller number of simulations, do whatever you wanted, collect results, then repeat this process until you have done sufficient simulations. You don't show what N is, but I suspect it is big, so try smaller N a number of times to give you N over-all.

gc() can help
saving data as .RData, closing, re-opening R, and loading the RData can help.
see my answer here: https://stackoverflow.com/a/24754706/190791 for more details

does R stop no matter the N value you use? try to use small values and see if it's the mvrnorm function that is the issue or you could simply loop it on subsets. Insert the gc() function in the loop to free some RAM continuously

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multicore and memory usage in R under Ubuntu - r

Related

object smaller than memory limit cannot be allocated [duplicate]

Parallel processing in R done wrong?

initializing parallel chains in rjags

Troubleshooting my confirmatory factor analysis using R [duplicate]

Memory Allocation "Error: cannot allocate vector of size 75.1 Mb" [duplicate]

Categories

Resources