I'm running cross-validation deep learning training (nfolds=4) iteratively for feature selection on H2O through R. Currently, I have only 2 layers (i.e. not deep) and between 8 and 50 neurons per layer. There are only 323 inputs, and 12 output classes.
To train one model takes in average around 40 seconds on my Intel 4770K, (32 GB ram). During training, H2o is able to max out all cpu cores.
Now, to try to speed up the training, I've set up an EC2 instance in the amazon cloud. I tried the largest compute unit (c4.8xlarge), but the speed up was minimal. It took around 24 seconds to train one model with the same settings. Therefore, I suspecting there's something I've overlooked.
I started the training like this:
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '24G', nthreads=-1)
Just to compare the processors, the 4770K got 10163 on cpu benchmark, while the Intel Xeon E5-2666 v3 got 24804 (vCPU is 36).
This speedup is quite disappointing to say the least, and is not worth all the extra work of installing and setting everything up in the amazon cloud, while paying over $2/hour.
Is there something else that needs to be done to get all cores working besides setting nthreads=-1 ?
Do I need to start making several clusters in order to get the training time down, or should I just start on a new deep learning library that supports GPUs?
To directly answer your question, no, H2O is not supposed to be slow. :-) It looks like you have a decent PC and the Amazon instances (even though there are more vCPUs) are not using the best processors (like what you would find in a gaming PC). The base / max turbo frequency of your PC's processor is 3.5GHz / 3.9GHz and the c4.8xlarge is only 2.9GHz / 3.5GHz.
I'm not sure that this is necessary, but since the c4.8xlarge instances have 60GB of RAM, you could increase max_mem_size from '24G' to at least '32G', since that's what your PC has, or even something bigger. (Although not sure that will do anything since memory is not usually the limiting factor, but may be worth a try).
Also, if you are concerned about EC2 price, maybe look into spot instances instead. If you require additional real speedup, you should consider using multiple nodes in your EC2 H2O cluster, rather than a single node.
Related
I'm trying to run parallel cv.glmnet poisson models on a windows machine with 64Gb of RAM. My data is a 20 million row x 200 col sparse matrix, around 10Gb in size. I'm using makecluster and doParallel, and setting parallel = TRUE in cv.glmnet. I currently have two issues getting this setup:
Distributing data to different processes is taking hours, reducing speedup significantly. I know this can be solved using fork on linux machines, but is there any way of reducing this time on windows?
I'm running this for multiple models with data and responses, so the object size is changing each time. How can I work out in advance how many cores I can run before getting an 'out of memory' error? I'm particularly confused at how the data gets distributed. If I run on 4 cores, the first rsession will use 30Gb of memory, while the others will be closer to 10Gb. What does that 30 Gb go towards, and is there any way of reducing it?
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).
I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually
My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?
It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.
You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.
Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.