I'm trying to run parallel cv.glmnet poisson models on a windows machine with 64Gb of RAM. My data is a 20 million row x 200 col sparse matrix, around 10Gb in size. I'm using makecluster and doParallel, and setting parallel = TRUE in cv.glmnet. I currently have two issues getting this setup:
Distributing data to different processes is taking hours, reducing speedup significantly. I know this can be solved using fork on linux machines, but is there any way of reducing this time on windows?
I'm running this for multiple models with data and responses, so the object size is changing each time. How can I work out in advance how many cores I can run before getting an 'out of memory' error? I'm particularly confused at how the data gets distributed. If I run on 4 cores, the first rsession will use 30Gb of memory, while the others will be closer to 10Gb. What does that 30 Gb go towards, and is there any way of reducing it?
Related
I would like to get some overview of what the options are for model comparison in brms when the models are large (brmsfit objects of ~ 6 GB due to 2000000 iterations).
My immediate problem is that add_criterion() won't run after models are finished on my laptop (16GB memory). I got the error message "vector memory exhausted (limit reached?)"; after which I increased the memory cap on R in Renviron to 100GB (as described here: R on MacOS Error: vector memory exhausted (limit reached?)). The total memory usage goes up to about 90 GB; I get error messages in R when I want to estimate both 'waic' and 'loo', if I just estimate 'loo', R invariably crashes.
What are my options here and what would be the recommendations?
Use the cluster - local convention is to use a single node, is this recommendable? (I guess not, as we have 6, 10, and 16GB cores. Any (link to) advice on parallelising R on a cluster is welcome.)
Is it possible to have a less dense posterior in brms, i.e. sample less during estimation, as in BayesTraits?
Can I parallelise R/RStudio on my own laptop?
...?
Many thanks for your advice!
I am working on doing some computation using parallel::mclapply to parallelize the process. I have a High performance computing server (HPC) with 64GB memory and 28 cores CPU. Speed of the code run has increased immensely after parallelizing, but lot of memory and cpu cores are getting wasted. How can I make it more efficient?
Here is the sample code:
data_sub <- do.call(rbind, mclapply(ds,predict_function,mc.cores=28))
The predict_function contains a small function to create snaive, naive or Arima methods which will be decided before the logic reaches the above line.
Here is what I often see on the log:
The first row indicates that the job has wasted 51 gig of RAM and utilized less than half of the CPU allocated. The third row indicates same program run with same data, but has used more than allocated memory despite under utilized the CPU cores.
Three questions currently running in my head:
How would HPC allocate memory for each job??
Can I split the memory and cores in my R program to run two functions parallely? Say run snaive method in 14 cores and allocate rest 14 to Arima?
How can I make my job utilize all the memory and CPU cores to make it faster?
Thanks in advance
Apologies if this question is too broad..
I'm running a large data set (around 20Gb on a 64Gb 4 core Linux machine) through cv.xgb in R. I'm currently hitting two issues:
Trying 10-fold cv crashes R (no error from xgboost, session just terminates).
Trying 5-fold, the code will run but reserves 100Gb of virtual memory, and slows to a crawl.
I'm confused as to why the code can do 5-fold but not 10-fold, I would have thought each fold would be treated seperately and would just take twice as long. What is xgboost doing across all folds?
With swap issues, is there any way to better manage the memory to avoid slowdown? the 5-fold cv is taking >10 times as long as a single run on a similar number of trees.
are there any packages better adapted to large data sets? or do I just need more RAM?
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I'm running cross-validation deep learning training (nfolds=4) iteratively for feature selection on H2O through R. Currently, I have only 2 layers (i.e. not deep) and between 8 and 50 neurons per layer. There are only 323 inputs, and 12 output classes.
To train one model takes in average around 40 seconds on my Intel 4770K, (32 GB ram). During training, H2o is able to max out all cpu cores.
Now, to try to speed up the training, I've set up an EC2 instance in the amazon cloud. I tried the largest compute unit (c4.8xlarge), but the speed up was minimal. It took around 24 seconds to train one model with the same settings. Therefore, I suspecting there's something I've overlooked.
I started the training like this:
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '24G', nthreads=-1)
Just to compare the processors, the 4770K got 10163 on cpu benchmark, while the Intel Xeon E5-2666 v3 got 24804 (vCPU is 36).
This speedup is quite disappointing to say the least, and is not worth all the extra work of installing and setting everything up in the amazon cloud, while paying over $2/hour.
Is there something else that needs to be done to get all cores working besides setting nthreads=-1 ?
Do I need to start making several clusters in order to get the training time down, or should I just start on a new deep learning library that supports GPUs?
To directly answer your question, no, H2O is not supposed to be slow. :-) It looks like you have a decent PC and the Amazon instances (even though there are more vCPUs) are not using the best processors (like what you would find in a gaming PC). The base / max turbo frequency of your PC's processor is 3.5GHz / 3.9GHz and the c4.8xlarge is only 2.9GHz / 3.5GHz.
I'm not sure that this is necessary, but since the c4.8xlarge instances have 60GB of RAM, you could increase max_mem_size from '24G' to at least '32G', since that's what your PC has, or even something bigger. (Although not sure that will do anything since memory is not usually the limiting factor, but may be worth a try).
Also, if you are concerned about EC2 price, maybe look into spot instances instead. If you require additional real speedup, you should consider using multiple nodes in your EC2 H2O cluster, rather than a single node.