Getting “OutOfMemory Error in GpuMemory: 0” from small CNN and small data-set - out-of-memory

My objective is to train a very simple CNN on MNIST using Tensorflow, convert it to TensorRT, and use it to perform inference on the MNIST test set using TensorRT, all on a Jetson Nano, but I am getting several errors and warnings, including “OutOfMemory Error in GpuMemory: 0”. To try and reduce memory footprint, I tried also creating a script where I simply load the TensorRT model (that had already been converted and saved in the previous script) and use it to perform inference on a small subset of the MNIST test set (100 floating point values), but I am still getting the same out of memory error. The entire directory containing the TensorRT model is only 488 KB, and the 100 test points can’t be taking up very much memory, so I am confused about why GPU memory is running out. What could be the reason for this, and how can I solve it?
Another thing which seems suspicious is that some of the Tensorflow logging info messages are being printed multiple times, EG “Successfully opened dynamic library libcudart”, “Successfully opened dynamic library libcublas”, “ARM64 does not support NUMA - returning NUMA node zero”. What could be the reason for this (EG dynamic libraries being opened over and over again), and could this have something to do with why the GPU memory keeps running out?
Shown below are the 2 Python scripts; the console output from each one is too long to post on Stack Overflow, but they can be seen attached to this Gist: https://gist.github.com/jakelevi1996/8a86f2c2257001afc939343891ee5de7
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU, and then performs inference.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Input
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# Create model
model = models.Sequential()
# model.add(Input(shape=x_train.shape[1:], batch_size=batch_size))
model.add(layers.Conv2D(10, (5, 5), activation='relu', padding="same"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10))
# Compile and train model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(
x_train[:10000], y_train[:10000], validation_data=(x_test, y_test),
batch_size=100, epochs=1,
)
# Save model
print("Saving model...")
current_dir = os.path.dirname(os.path.abspath(__file__))
model_dir = os.path.join(current_dir, "CNN_MNIST")
if not os.path.isdir(model_dir): os.makedirs(model_dir)
# model.save(model_dir)
tf.saved_model.save(model, model_dir)
# Convert to TRT format
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
converter = trt.TrtGraphConverterV2(input_saved_model_dir=model_dir)
converter.convert()
converter.save(trt_model_dir)
t1 = perf_counter()
print("Finished TRT conversion; time taken = {:.3f} s".format(t1 - t0))
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# TEMPORARY: just use 100 test points to minimise GPU memory
num_points = 100
x_test, y_test = x_test[:num_points], y_test[:num_points]
current_dir = os.path.dirname(os.path.abspath(__file__))
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
t1 = perf_counter()
print("Finished inference; time taken = {:.3f} s".format(t1 - t0))

I had the same error on a Jetson Tx2. I think it comes from the shared memory between the GPU and the CPU, tensorflow doesn't allow enough memory or the os limit the allocation.
To fix this, you can allow memory growth:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Or you can force tensorflow to allocate enough memory:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Those example comes from https://www.tensorflow.org/guide/gpu

I see in logs that it created GPU device with 600 Mb:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 638 MB memory)
And then it tried to allocate 1Gb:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.00GiB (rounded to 1073742336).
Also it's clear. that GPU device has more memory than 600Mb. It's visible here in the logs:
2020-06-23 23:06:36.463934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
So maybe your GPU is running some other calculation?

Related

Q: R Torch memory usage on GPU for convolutional LSTM tutorial

The convolution LSTM tutorial linked from the Torch for R frontpage is written to run on cpu by default -- no calls to a gpu device. I can set up and run it just fine.
When I make a GPU modification, as follows in the code below,
model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3,3), n_layers = 2)
#---CUDA modification
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)
#---
optimizer <- optim_adam(model$parameters)
num_epochs <- 100
for (epoch in 1:num_epochs) {
model$train()
batch_losses <- c()
for (b in enumerate(dl)) {
optimizer$zero_grad()
preds <- model(b$x$to(device = device))[[2]][[2]][[1]] #last time step output from the last layer
loss <- nnf_mse_loss(preds, b$y$to(dtype = torch_float(), device = device))
batch_losses <- c(batch_losses, loss$item())
loss$backward()
optimizer$step()
}
if(epoch %%10 ==0)
cat(sprintf("\nEpoch %d, training loss: %3f\n", epoch, mean(batch_losses)))
}
I get 40-50 epochs in before encountering an obvious "your GPU outta memory," error:
Error in (function (self, gradient, retain_graph, create_graph) :
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 6.00 GiB total capacity; 4.40 GiB already allocated; 10.19 MiB free; 4.74 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFF51AFA7B200007FFF51AFA750 c10.dll!c10::Error::Error [<unknown file> # <unknown line number>]
My GPU is pretty dated, a 2014-2015 GeForce GTX 970M, 3GB of memory native. But the tensors in this example are not particularly large. They're dim(100 6 1 24 24) synthetic tensors, although admittedly this structure is preserving all of the hidden and cell states. I don't have the background to calculate what this 'should' be using, but my intuition is that something about this setup (or the current R torch implementation) isn't cleaning up after itself, particularly through training epochs.
Is anyone able to either reproduce or easily run a GPU modification of my example, and is there a straightforward solution here or is there simply a fundamental limit to the capacity of my GPU in this case?

How to bootstrap using large datasets?

I would like to use the boot() and boot.ci() functions from library("boot") for a large data set(~20 000) with type="bca".
If R(number of bootstraps) is too small (I have tried 1k - 10k), then I get the following error:
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, :
estimated adjustment 'a' is NA
However, if I do 15k - 20+k bootstraps, then I get:
Cannot allocate vector size # GB
(usually ranging from 1.7 to 6.4gb, depending on the dataset and # of bootstraps).
I read that I needed to have more ram, but I have Windows desktop with 16gb ram and I'm using 64-bit R, suggesting my computer should be able to handle this.
How can I use bootstrapping methods on larger datasets if too few bootstraps cannot produce estimates and sufficient bootstraps results in insufficient memory?
My code:
multRegress<-function(mydata){
numVar<<-NCOL(mydata)
Variables<<- names(mydata)[2:numVar]
mydata<-cor(mydata, use="pairwise.complete.obs")
RXX<-mydata[2:numVar,2:numVar]
RXY<-mydata[2:numVar,1]
RXX.eigen<-eigen(RXX)
D<-diag(RXX.eigen$val)
delta<-sqrt(D)
lambda<-RXX.eigen$vec%*%delta%*%t(RXX.eigen$vec)
lambdasq<-lambda^2
beta<-solve(lambda)%*%RXY
rsquare<<-sum(beta^2)
RawWgt<-lambdasq%*%beta^2
import<-(RawWgt/rsquare)*100
result<<-data.frame(Variables, Raw.RelWeight=RawWgt,
Rescaled.RelWeight=import)
}
# function passed to boot
multBootstrap <- function(mydata, indices){
mydata<-mydata[indices,]
multWeights<-multRegress(mydata)
return(multWeights$Raw.RelWeight)
}
# call boot
multBoot<-boot(thedata, multBootstrap, 15000)
multci<-boot.ci(multBoot,conf=0.95, type="bca")

Running Random Search in mlr R package on Ubuntu 18.04 takes too long

I have a problem when I search for optimal hyperparameters of xgboost using mlr package in R, using Random Search method, on Ubuntu 18.04. This is the setup code for the search:
eta_value <- 0.05
set.seed(12345)
# 2. Create tasks
train.both$y <- as.factor(train.both$y) # altering y in train.both!
traintask <- makeClassifTask(data = train.both,target = "y")
# 3. Create learner
lrn <- makeLearner("classif.xgboost",predict.type = "prob")
lrn$par.vals <- list(
objective="binary:logistic",
booster = "gbtree",
eval_metric="auc",
early_stopping_rounds=10,
nrounds=xgbcv$best_iteration,
eta=eta_value,
weight = train_data$weights
)
# 4. Set parameter space
params <- makeParamSet(
makeDiscreteParam("max_depth", values = c(4,6,8,10)),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeDiscreteParam("subsample", values = c(0.5, 0.75, 1)),
makeDiscreteParam("colsample_bytree", values = c(0.4, 0.6, 0.8, 1)),
makeNumericParam("gamma",lower = 0L,upper = 7L)
)
# 5. Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=10L)
# 6. Search strategy
ctrl <- makeTuneControlRandom(maxit = 60L, tune.threshold = F)
# Set parallel backend and tune parameters
parallelStartMulticore(cpus = detectCores())
# 7. Parameter tuning
timer <- proc.time()
mytune <- tuneParams(learner = lrn,
task = traintask,
resampling = rdesc,
measures = auc,
par.set = params,
control = ctrl,
show.info = T)
proc.time() - timer
parallelStop
As you can see I distribute the search task among all my CPU cores. The problem is that it has been over 5 days and the task is still running - this is the mlr output for the task (displayed when the task is running):
[Tune] Started tuning learner classif.xgboost for parameter set:
Type len Def Constr Req Tunable Trafo
max_depth discrete - - 4,6,8,10 - TRUE -
min_child_weight numeric - - 1 to 10 - TRUE -
subsample discrete - - 0.5,0.75,1 - TRUE -
colsample_bytree discrete - - 0.4,0.6,0.8,1 - TRUE -
gamma numeric - - 0 to 7 - TRUE -
With control class: TuneControlRandom
Imputation value: -0
Mapping in parallel: mode = multicore; level = mlr.tuneParams; cpus = 16; elements = 60.
I used to run this on my macbook pro laptop and it finished within approximately 8 hours. The laptop was 15-inch 2018 2.6 GHz intel core i7 (6 cores) with 32 GB memory DDR4.
Now I run it on a much stronger computer - the only thing that is changed is that this is an Ubuntu OS. The machine I'm having this problem on is a stationary computer with Intel i9-9900K CPU # 3.60GHz x 16 cores. The desktop is GNOME 3.28.2, OS type is 64-bit and it has 64GB of RAM.
I have attached a screenshot which I took during the running of the mlr searching task - it shows that not all the CPU cores are engaged, something that was the opposite when I ran this on the MacBook Pro laptop.
What is the problem here? Is it something that has to do with the Ubuntu system and its capabilities of parallelization?
I have found a somewhat-similar question here but there was no apparent solution there as well.
When I try to run this from the terminal instead of from RStudio, it still seems that the cores are not engaged:
There is nothing running at all according to your screenshot. Based on your setup all cores should be at 100%.
Your issue has nothing to do with your operating system per se. In fact, Linux is most often the best choice when it comes to parallelization.
There are some problems sometimes when combining the "multicore" mode with xgboost, see for example https://github.com/berndbischl/parallelMap/issues/72.
You can simply try again. If that does not work, try switching the parallelization mode to "socket".
It is hard to detect the real root of your problem since there are multiple players involved in the game (ports, conflicts with openMP, etc.).

BoundsError in Julia MXNet when using small batch size

I'm trying to reproduce some Python MXNet code in Julia 0.6.0, and I'm getting a BoundsError if I try to use a batch size that is smaller than the dimension of the output. If I use a larger batch size in a toy example, things work properly and the network converges to the correct solution, but in my application the output dimension is large so this isn't practical.
Here's a linear regression example that gives this error:
using MXNet
net = mx.Variable(:data)
net = mx.FullyConnected(net, name=:fc0, num_hidden=5)
net = mx.LinearRegressionOutput(net, name=:output)
mod = mx.FeedForward(net, context=mx.cpu(0))
batch_size = 4 # works for batch_size > 4
A = randn(5,100)
train_in = randn(100,1000)
train_out = A*train_in + .1*randn(5,1000)
train_provider = mx.ArrayDataProvider(:data=>train_in,
:output_label=>train_out,
shuffle=true,
batch_size=batch_size)
optimizer = mx.SGD(lr=0.001, momentum=0.9, weight_decay=0.00001)
mx.fit(mod, optimizer, train_provider)
This produces
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 0 MB allocated on CPU0
INFO: Start training...
ERROR: LoadError: BoundsError: attempt to access 5×4 Array{Float32,2} at index [Base.Slice(Base.OneTo(5)), 5]
If I increase the batch size to 5 or greater, it works as expected. What am I missing?
You can track the resolution of this bug here:
https://github.com/dmlc/MXNet.jl/issues/264
I have tested it two weeks ago and unfortunately it is still happening.

makeCluster with parallelSVM in R takes up all Memory and swap

I'm trying to train a SVM model on a large dataset(~110k training points). This is a sample of the code where I am using the parallelSVM package to parallelize the training step on a subset of the training data on my 4 core Linux machine.
numcore = 4
train.time = c()
for(i in 1:5)
{
cl = makeCluster(4)
registerDoParallel(cores=numCore)
getDoParWorkers()
dummy = train_train[1:10000*i,]
begin = Sys.time()
model.svm = parallelSVM(as.factor(target) ~ .,data =dummy,
numberCores=detectCores(),probability = T)
end = Sys.time() - begin
train.time = c(train.time,end)
stopCluster(cl)
registerDoSEQ()
}
The idea of this snippet of code is to estimate the time it'll take to train the model on the entire dataset by gradually increasing the size of the dummy training set. After running the code above for 10,000 and 20,000 training samples, this is the memory and swap history usage statistic from the System Monitor.After 4 runs of the for loop,both the memory and swap usage is about 95%,and I get the following error :
Error in summary.connection(connection) : invalid connection
Any ideas on how to manage this problem? Is there a way to deallocate the memory used by a cluster after using the stopCluster() function ?
Please take into consideration the fact that I am an absolute beginner in this field. A short explanation of the proposed solutions will be greatly appreciated. Thank you.
Your line
registerDoParallel(cores=numCore)
creates a new cluster with number of nodes equal to numCore (which you haven't stated). This cluster is never destroyed, so with each iteration of the loop you're starting more new R processes. Since you're already creating a cluster with cl = makeCluster(4), you should use
registerDoParallel(cl)
instead.
(And move the makeCluster, registerDoParallel, stopCluster and registerDoSEQ calls outside the loop.)

Resources