I'm facing a problem with parallel computing in R regarding memory allocation.
I'm experiencing that the memory usage starts to grow and grow so that, in some point, it runs out.
For examples, please have a look to this code:
library(parallel)
set.seed(100)
data <- runif(100000000)
val <- mclapply(1:length(data), function(x) {data[x]/2}, mc.cores=3)
If I use more than one core, I miss the control over the memory management (in a machine with 8GB of RAM, I get into swap very quickly). Taking into account that data is a shared variable that shouldn't be replicated at each instance, which the best way of accessing to such resource only for reading purposes?
Related
Is there any way to unlimit the CPU usage so my PC puts more effort in finishing a task for rapidly? At the moment the k-means algorithm is estimated to finish in about 10 days, which is something I would like to reduce.
R is single-threaded by default, and runs only on a single thread on the CPU, which is a pity if you have a machine with 16 or 32 cores. By unlimiting the CPU usage, I have to assume you're asking if there's any way to have an R process (let's say part of the k-means algorithm) take advantage of your full CPU power by running the process in-parallel.
Many R packages and processes are not going to be helped by parallel processing though. So the technical solution to your particular problem goes down to the package implementation you're using. Popular packages like caret do support parallelization when that's possible, even though you may need to add an additional allowParallel=T parameter. They work in conjunction with a library such as doMC to allow multi-core processes. In the following sample code, I have my machine use 8 cores through the registerDoMC(8) function, and then set allowParallel=T.
library(doMC)
registerDoMC(8)
system.time({
ctrl_2 <- trainControl(method="cv", number=3, allowParallel=T)
fb_forest_2 <- train(classe ~ ., data=fb_train, method="rf", trControl = ctrl_2)
})
Again, parallel processing doesn't always help - Not all process can be parallelized! The documentation for foreach are a great read so if you can afford the time take a look at it. The specific code solution for your problem also depend on the library implementation you're using.
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I have the following problem/question:
I've written an R functions which is smoothing values from a time series. The time series is defined by a big number of single global raster files, hence each single pixel a series with n timesteps (generally more than 500). Even though I've plenty of RAM, I have to rely on blockwise processing because loading the entire dataset is just too much. So far so good.
I've written (IMHO) a fairly decent code, which leverages parallel processing when possible. I have a processing machine which should be more than well equipped to handle this amount of data and computation. This leads me to believe that most of the time will be spent by reading lots of values from the disk and then, after smoothing, writing lots of values to the disk.
So I've tried running the code with the files being on either a normal HDD or a normal SSD.
Against my expectations, it didn't really matter much.
Then I tried running a test function which reads a file, gets the values and writes them back to disk with the raster being on either the HDD, the SSD or a blazing fast SSD. Again, no significant difference.
I've already done a fair share of profiling to find bottlenecks, as well as a good amount of time googling for efficient solutions. There's bits of info here and there, but I decided to post this question here to get a definitive answer and maybe some pointers for me and others how to efficient manage things.
So without further ado (and for people who skipped the above), here's my question:
In a setting as described above (high data volume, blockwise processing, reading and writing from/to disk), what's the most efficient (and/or fastest) way to do computation on a long raster time series which involves reading and writing values from/to disk? (especially regarding the read write aspect)
Assuming I have a fast SSD, how can I leverage the speed? Is it done automatically?
What are the influencing factors (filesize, filetype, caching) and the most efficient setting of these factors?
I know that in terms of raster, R works the fastest with .grd, but I would like to avoid this format for flexibility, compatibility and diskspace reasons.
Maybe I'm also having a misconception of how the raster package interacts with the files on disk. In that case, should I use different functions than getValues and writeValues ?
-- Some system info and example code: --
Os: Win7 x64
CPU: Xenon E5-1650 # 3.5 GHz
RAM: 128 GB
R-version: 3.2
Raster file format: .rst
Read/write benchmark function:
benchfun <- function(x){
# x ... raster file
xr <- raster(x)
x2 <- raster(xr)
xval <- getValues(xr)
x2 <- setValues(x2,xval)
writeRaster(x2,'testras.tif',overwrite=TRUE)
}
If needed I can also provide a little example code for the time series processing, but for now I don't think it's needed.
Appreciate all tips!
Thanks,
Val
This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.
You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".
You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores workers at a time, regardless of the value of mc.prescheduling.
I just tested an elastic net with and without a parallel backend. The call is:
enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )
I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.
Questions:
1) Is this a general property of parallel computation? I thought they shared memory....
2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?
Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:
Extremely high memory consumption of new doParallel package
In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.
However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.
In other words, using:
registerDoParallel(7)
may be much more memory efficient than using:
cl <- makeCluster(7)
registerDoParallel(cl)
since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.
However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.
So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.
There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.
http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf