I am trying to understand what are performance issues with our system
The CPU wait IO says
Max is 255m and now is 9.5m
What is this m , how to interpret the stats
Related
I am working on doing some computation using parallel::mclapply to parallelize the process. I have a High performance computing server (HPC) with 64GB memory and 28 cores CPU. Speed of the code run has increased immensely after parallelizing, but lot of memory and cpu cores are getting wasted. How can I make it more efficient?
Here is the sample code:
data_sub <- do.call(rbind, mclapply(ds,predict_function,mc.cores=28))
The predict_function contains a small function to create snaive, naive or Arima methods which will be decided before the logic reaches the above line.
Here is what I often see on the log:
The first row indicates that the job has wasted 51 gig of RAM and utilized less than half of the CPU allocated. The third row indicates same program run with same data, but has used more than allocated memory despite under utilized the CPU cores.
Three questions currently running in my head:
How would HPC allocate memory for each job??
Can I split the memory and cores in my R program to run two functions parallely? Say run snaive method in 14 cores and allocate rest 14 to Arima?
How can I make my job utilize all the memory and CPU cores to make it faster?
Thanks in advance
I am trying to train a list of R caret models on Google Cloud Compute Engine (Ubuntu LTS16.04). The xgboost (both xgblinear and xgbtree) model took forever to complete the training. In fact, the CPU utilization is always 0 from GCP status monitoring.
I used doMC library for parallel execution. It works very well for models like C5.0, glmnet and gbm. However, for xgboost (both xgblinear and xgbtree),due to some reason, the CPU seems not running because the utilization remains 0. Troubleshooting:
1. Removed the doMC and run with single core only, same problem remained.
2. Changed the parallel execution library to doParallel instead of doMC. This round the CPU utilization went up, but it took 5 mins to complete the training on GCP. The same codes finished in just 12 seconds on my local laptop. (I ran 24 CPUs on GCP, and 4 CPUs on my local laptop)
3. The doMC parallel execution works well for other algorithm. Only xgboost has this problem.
Code:
xgblinear_Grid <- expand.grid(nrounds = c(50, 100),
lambda = c(.05,.5),
alpha = c(.5),
eta = c(.3))
registerDoMC(cores = mc - 1)
set.seed(123)
xgbLinear_varimp <- train(formula2, data=train_data, method="xgbLinear", metric=metric, tuneGrid = xgblinear_Grid, trControl=fitControl, preProcess = c("center", "scale", "zv"))
print(xgbLinear_varimp)
No error message generated. It simply runs endlessly.R sessionInfo
I encountered the same problem, and it took a long time to understand the three reasons behind it:
xgbLinear requires more memory than any other machine learning algorithm available in the caret library. For every core, you can assume at least 1GB RAM even for tiny datasets of only 1000 x 20 dimension, for bigger datasets more.
xgbLinear in combination with parallel execution has a final process that recollects the data from the threads. This process is usually responsible for the 'endless' execution time. Again, the RAM is the limiting factor. You might have seen the following error message that which is often caused by to little allocation of RAM:
Error in unserialize(socklist[[n]]) : error reading from connection
xgbLinear has its own parallel processing algorithm which gets mixed up with the doParallel algorithm. Here, the effective solution is to set xgbLinear to single-thread by an additional parameter in caret::train() - nthread = 1 - and let doParallel do the parallelization
As illustration for (1), you can see here that the memory utilization nears 80 GB:
and 235GB for a training a still tiny dataset of 2500x14 dimensionality:
As illustration for (2), you can see here that this is the process that takes forever if you don't have enough memory:
I'm using the parallel package to get better CPU utilization and I thought it will reduce computation time significantly. But I got the opposite results, while CPU utilization got almost 100% for the 4 cores that I got, the time results indicate that using the parallel produced worst results that not using it. How can it be? Is this a problem with the package? Am I missing something else? my code is big so I can't present it here..
time without parallel 45 sec 1.04 min 1.5 min 6.14 min
time with parallel 1.3 min 1.7 min 2.3 min 14.5 min
number of variables 78 78 78 870
number of rows 30k 50k 70k 70k
Before going to parallel processing you should try to improve the single core performance. Without seeing your code we cannot give any concrete advice, but the first step should be to profile your code. Useful resources are
http://adv-r.had.co.nz/Performance.html and
https://csgillespie.github.io/efficientR/.
Once you have achieved good single core performance, you can try parallel processing. As hinted in the comments, it is crucial to keep the communication overhead low. Again, without seeing your code we cannot give any concrete advice, but here is some general advice:
Do not use a sequence of multiple parallelized steps. A single parallelized step which does all the work in sequence will have lower communication overhead.
Use a reasonable chunk size. If you have 10.000 tasks then don't send the individually but in suitable groups. The parallel package does that by default as long as you do not use "load balancing". If you need load balancing for some reason, then you should group the tasks into a smaller number of chunks to be handled by the load balancing algorithm.
I'm running cross-validation deep learning training (nfolds=4) iteratively for feature selection on H2O through R. Currently, I have only 2 layers (i.e. not deep) and between 8 and 50 neurons per layer. There are only 323 inputs, and 12 output classes.
To train one model takes in average around 40 seconds on my Intel 4770K, (32 GB ram). During training, H2o is able to max out all cpu cores.
Now, to try to speed up the training, I've set up an EC2 instance in the amazon cloud. I tried the largest compute unit (c4.8xlarge), but the speed up was minimal. It took around 24 seconds to train one model with the same settings. Therefore, I suspecting there's something I've overlooked.
I started the training like this:
localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '24G', nthreads=-1)
Just to compare the processors, the 4770K got 10163 on cpu benchmark, while the Intel Xeon E5-2666 v3 got 24804 (vCPU is 36).
This speedup is quite disappointing to say the least, and is not worth all the extra work of installing and setting everything up in the amazon cloud, while paying over $2/hour.
Is there something else that needs to be done to get all cores working besides setting nthreads=-1 ?
Do I need to start making several clusters in order to get the training time down, or should I just start on a new deep learning library that supports GPUs?
To directly answer your question, no, H2O is not supposed to be slow. :-) It looks like you have a decent PC and the Amazon instances (even though there are more vCPUs) are not using the best processors (like what you would find in a gaming PC). The base / max turbo frequency of your PC's processor is 3.5GHz / 3.9GHz and the c4.8xlarge is only 2.9GHz / 3.5GHz.
I'm not sure that this is necessary, but since the c4.8xlarge instances have 60GB of RAM, you could increase max_mem_size from '24G' to at least '32G', since that's what your PC has, or even something bigger. (Although not sure that will do anything since memory is not usually the limiting factor, but may be worth a try).
Also, if you are concerned about EC2 price, maybe look into spot instances instead. If you require additional real speedup, you should consider using multiple nodes in your EC2 H2O cluster, rather than a single node.
My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?
It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.
You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.
Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.