Why can `tf.distribute.MirroredStrategy` alleviate the OOM problem? - out-of-memory

I'm training my neural network on a GPU and I always get OOM errors when I make the following changes to tf.distribution.MirroredStrategy.
with tf.distribute.MirroredStrategy(devices=gpus).scope():
model=get_model() # The model can be trained smoothly
with tf.device('/device:GPU:0'):
model=get_model()
# Memory will grow after each epoch, and the training process will be killed within three epochs
I would like to know what could be the cause of this? Are there any possible suggestions to fix this problem? Thank you.

Related

How to do memory management, monitoring, and testing in R?

I notice that using R to program constantly causing some running speed problem, especially when the code involves growing a list. It's just very unintuitive what is slowing down the program so dramatically during the looping.
Specifically I'm using caret to train a gbm model, after getting the tuned hyperparameter, I need to do LOOCV to obtain the test error, which demands me to train the model for n times (n=number of samples). All I store in the list is my prediction result. Yet the list grows slower as the loop progress.
Can you offer some general advice for testing the memory issues related to R programming?
First create an empty list/vector. Size of your "n" so R has not to rewrite it every time it wants to add one additional value

Is there a way to track progress during parallelized Random Forest building?

I'm using R's caret package to do modeling for Coursera class on machine learning.
I'm currently building Random Forest with 500 trees on a data set of 11k observations and 40 features.
It took about 3 hours for single core implementation to compute results and I'm experimenting with multi-core implementation right now (code below)
library(parallel)
library(caret)
library(doParallel)
library(foreach)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
trCtrl <- trainControl(allowParallel = TRUE)
modFit2 <- train(classe~ ., data=training, trControl = trCtrl, method="parRF", prox=TRUE, ntree = 500)
Now my question is this: Is there a way to view progress on build model during run-time? Is there a package/implementation of parallelized RF that outputs for example progress on number of trees built as it run?
Obvious question is: why do I need to know? Cant I just wait this hour or two for results? It wont be faster but might be slower that way!
I have a lot of models to build for my class and I dont want to spend few hours on each model and wonder if it is running or not. I want to confirm that it is building trees, stop execution and schedule it for the night when I will run full models. I will be running different configurations of parameters for RF and also some other time intensive models so I would rather spend my day-time on writing code while leave my computer on the mercy of running computation full speed when I'm sleeping (my browser is barely working right now :P as both my RAM and CPU are almost at 100%)
You could use getModelInfo to add cat statements to the fit function. Also, there is a verboseIter option in trainControl that you are ignoring here.
Probably the problem is that you are using trainControl(allowParallel = TRUE). This is going to try to fit the resampling iterations across different cores and using method="parRF" fits each of those in parallel.
If you specify 4 cores on your machine, you have probably spawned 16 workers. You are probably better off using method = "rf" and trainControl(allowParallel = TRUE). That might also mean that you have 17 copies of the data in memory.

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

R becomes unresponsive while running randomforest on huge data. Does this mean it is still running or it has stopped working?

My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?
It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.
You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.
Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.

How to solve out of memory error?

I am doing my project in OCR.For this i am using image size of 64x64 because when i tried 32x32 etc some pixels is lost.I have tried features such as zonal density, Zernike's moments,Projection Histogram,distance profile,Crossing .The main problem is feature vector size is too big .I have take the combination of above features and tried.But whenever i train the neural network ,i have got an error "out of memory". I have tried PCA dimensionality reduction but its not work good.i didn't get efficiency during training.Run the code in my PC and laptop.In both of them i have got same error.my RAM is 2GB.so i think about reducing the size of an image.is there any solution to solve this problem.
I have one more problem whenever i tried to train the neural network using same features result is varied.how to solve this also?
It's not about the size of the image. A 64*64 image is sure not to blow your RAM. There must be bugs in your Neuron Network or other algorithms.
And please paste more details about your implementation. We don't even know what language you are using.

Resources