BoundsError in Julia MXNet when using small batch size - julia

I'm trying to reproduce some Python MXNet code in Julia 0.6.0, and I'm getting a BoundsError if I try to use a batch size that is smaller than the dimension of the output. If I use a larger batch size in a toy example, things work properly and the network converges to the correct solution, but in my application the output dimension is large so this isn't practical.
Here's a linear regression example that gives this error:
using MXNet
net = mx.Variable(:data)
net = mx.FullyConnected(net, name=:fc0, num_hidden=5)
net = mx.LinearRegressionOutput(net, name=:output)
mod = mx.FeedForward(net, context=mx.cpu(0))
batch_size = 4 # works for batch_size > 4
A = randn(5,100)
train_in = randn(100,1000)
train_out = A*train_in + .1*randn(5,1000)
train_provider = mx.ArrayDataProvider(:data=>train_in,
:output_label=>train_out,
shuffle=true,
batch_size=batch_size)
optimizer = mx.SGD(lr=0.001, momentum=0.9, weight_decay=0.00001)
mx.fit(mod, optimizer, train_provider)
This produces
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 0 MB allocated on CPU0
INFO: Start training...
ERROR: LoadError: BoundsError: attempt to access 5×4 Array{Float32,2} at index [Base.Slice(Base.OneTo(5)), 5]
If I increase the batch size to 5 or greater, it works as expected. What am I missing?

You can track the resolution of this bug here:
https://github.com/dmlc/MXNet.jl/issues/264
I have tested it two weeks ago and unfortunately it is still happening.

Related

How do I fix Newton failed to find minimum in R

I have a series of ARGOS data from various individual animals. I am using the R package aniMotum (previously foiegras) to create move persistence maps like this example output from the documentation. My goal is to be able to create one of these maps for each animal in my dataframe.
When I adapt the example code to my tracking data, it fails on some tracks and not others. When it fails, I receive a series of errors like these
Error in newton(par = c(X = 4205.28883097774, X = -5904.86998464547, X = 4204.95187507608, :
Newton failed to find minimum.
And a final error
Warning message:
The optimiser failed. Try simplifying the model with the following argument:
map = list(rho_o = factor(NA))
Here is a sample of the code I used
library(aniMotum)
library(tidyverse)
movePersistence <- readRDS("newton_error.rds")
x <-fit_ssm(movePersistence,
vmax = 3,
model = "mp",
time.step = 2,
control = ssm_control(verbose = 2)
)
aniMotum::map(x, what = "p", normalise = TRUE, silent = TRUE)
From what I can tell, the issue is occurring from the Newton function in the TMB package (a dependency of Animotum). I have noticed that if I change the time.step value, it will change the number of errors I get. For example, on track 14, I get 13 Newton errors with a step of 2 but none with a step of 12. I am using 2 as it gives a better approximation than 12. I did try my whole dataset at various steps and each time, different tracks fail. I also tried changing the tolerance and number of iterations in the ssm_control but that was more of a hail-Mary approach that was not working.
Here is a dput() of track 14 to reproduce the errors on a smaller scale using the above code - https://pastebin.com/xMXzdPDh - if anyone has some recommendations I would greatly appreciate it as I do not understand the inconsistent results.

Why does running the tutorial example fail with Rstan?

I installed the RStan successfully, the library loads. I try to run a tutorial example (https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started#example-1-eight-schools). After applying the stan function
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 1, # number of cores (could use one per chain)
refresh = 0 # no progress shown
)
I get the following error:
*Error in compileCode(f, code, language = language, verbose = verbose) :
C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb8internal26task_scheduler_observer_v3D0Ev[_ZN3tbb8internal26task_scheduler_observer_v3D0Ev]+0x1d): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface623task_scheduler_observerD1Ev[_ZN3tbb10interface623task_scheduler_observerD1Ev]+0x1d): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface623task_scheduler_observerD1Ev[_ZN3tbb10interface623task_scheduler_observerD1Ev]+0x3a): undefined reference to `tbb::internal::task_scheduler_observer_v3::observe(bool)'C:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: file1c34165764a4.o:file1c34165764a4.cpp:(.text$_ZN3tbb10interface
Error in sink(type = "output") : invalid connection*
Simply running example(stan_model, run.dontrun=T) gives the same error.
What does this error mean?
Is rtools wrongly installed? My PATH seems to contain the correct folder C:\\rtools42\\x86_64-w64-mingw32.static.posix\\bin;. Is something wrong with the version of my Rstan package? I am struggling to interpret this error?
What to try to solve it?
Apparently Rtools42 in incompatible with the current version Rstan on CRAN. The solution that worked for me is found here: https://github.com/stan-dev/rstan/wiki/Configuring-C---Toolchain-for-Windows#r-42

Octave Error: out of memory or dimension too large for Octave's index type

I am trying to run the following code in Octave. The variable "data" consists of 864 rows and 25333 columns.
clc; clear all; close all;
pkg load statistics
GEO = load("GSE59739.mat");
GEOT = tabulate(GEO.class)
data = GEO.data;
clear GEO
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
xlabel('Silhouette Value')
ylabel('Cluster')
This is the error I get when trying to run the silhouette function:
"error: out of memory or dimension too large for Octave's index type". Any idea on how I can fix it?
It appears the problem is not necessarily with your data but with the way Octave's statistics package has implemented pdist. It uses an expansion that results in an array with dimensions that do exceed the system limits, just as the error message says.
Running through your example with some dummy data of the same size, on Octave 6.4.0 and statistics 1.4.3, I get:
pkg load statistics
data = rand(864,25333);
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
error: out of memory or dimension too large for Octave's index type
error: called from
pdist at line 164 column 14
silhouette at line 125 column 16
pdist is a function to calculate the "distance" between any two rows in matrix, using one of several methods. silhouette is called using the cosine metric, and the error occurs in that calculation section:
pdist, lines 163-166 cosine block:
case "cosine"
prod = X(:,Xi) .* X(:,Yi);
weights = sumsq (X(:,Xi), 1) .* sumsq (X(:,Yi), 1);
y = 1 - sum (prod, 1) ./ sqrt (weights);
The first line calculating prod causes the error, as X = data' is 25333x864, and Xi and Yi are each 372816x1, and were formed by running nchoosek(1:rows(data),2) (producing 372816 sets of all 2 element combinations of 1:864).
X(:,Xi) and X(:,Yi) each request creation of a rows(X) x rows(Xi) array, or 25333x372816, or 9,444,547,728 elements, which for double precision data requires 75,556,381,824 Bytes or 75.6GB. Odds are your machine can't handle this.
Just checking with Matlab 2022a, it is able to run those lines without any out of memory errors in a few seconds and the test1 output is only 864x1. So it appears this excessive memory overhead is an issue specific to Octave's implementation and not inherent to the the technique.
I've filed a bug report regarding this behavior at https://savannah.gnu.org/bugs/index.php?62495, but for now the answer appears to be that the 'cosine' metric, and perhaps others as well, simply cannot be used with input data of this size.
Update: as of 19 JUN 2022, a fix for this pdist memory problem has been pushed to the statistics package repository, and will be included in the next major package release. In the meantime the updated function can be found at https://github.com/gnu-octave/statistics/blob/main/inst/pdist.m

Q: R Torch memory usage on GPU for convolutional LSTM tutorial

The convolution LSTM tutorial linked from the Torch for R frontpage is written to run on cpu by default -- no calls to a gpu device. I can set up and run it just fine.
When I make a GPU modification, as follows in the code below,
model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3,3), n_layers = 2)
#---CUDA modification
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)
#---
optimizer <- optim_adam(model$parameters)
num_epochs <- 100
for (epoch in 1:num_epochs) {
model$train()
batch_losses <- c()
for (b in enumerate(dl)) {
optimizer$zero_grad()
preds <- model(b$x$to(device = device))[[2]][[2]][[1]] #last time step output from the last layer
loss <- nnf_mse_loss(preds, b$y$to(dtype = torch_float(), device = device))
batch_losses <- c(batch_losses, loss$item())
loss$backward()
optimizer$step()
}
if(epoch %%10 ==0)
cat(sprintf("\nEpoch %d, training loss: %3f\n", epoch, mean(batch_losses)))
}
I get 40-50 epochs in before encountering an obvious "your GPU outta memory," error:
Error in (function (self, gradient, retain_graph, create_graph) :
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 6.00 GiB total capacity; 4.40 GiB already allocated; 10.19 MiB free; 4.74 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFF51AFA7B200007FFF51AFA750 c10.dll!c10::Error::Error [<unknown file> # <unknown line number>]
My GPU is pretty dated, a 2014-2015 GeForce GTX 970M, 3GB of memory native. But the tensors in this example are not particularly large. They're dim(100 6 1 24 24) synthetic tensors, although admittedly this structure is preserving all of the hidden and cell states. I don't have the background to calculate what this 'should' be using, but my intuition is that something about this setup (or the current R torch implementation) isn't cleaning up after itself, particularly through training epochs.
Is anyone able to either reproduce or easily run a GPU modification of my example, and is there a straightforward solution here or is there simply a fundamental limit to the capacity of my GPU in this case?

Getting “OutOfMemory Error in GpuMemory: 0” from small CNN and small data-set

My objective is to train a very simple CNN on MNIST using Tensorflow, convert it to TensorRT, and use it to perform inference on the MNIST test set using TensorRT, all on a Jetson Nano, but I am getting several errors and warnings, including “OutOfMemory Error in GpuMemory: 0”. To try and reduce memory footprint, I tried also creating a script where I simply load the TensorRT model (that had already been converted and saved in the previous script) and use it to perform inference on a small subset of the MNIST test set (100 floating point values), but I am still getting the same out of memory error. The entire directory containing the TensorRT model is only 488 KB, and the 100 test points can’t be taking up very much memory, so I am confused about why GPU memory is running out. What could be the reason for this, and how can I solve it?
Another thing which seems suspicious is that some of the Tensorflow logging info messages are being printed multiple times, EG “Successfully opened dynamic library libcudart”, “Successfully opened dynamic library libcublas”, “ARM64 does not support NUMA - returning NUMA node zero”. What could be the reason for this (EG dynamic libraries being opened over and over again), and could this have something to do with why the GPU memory keeps running out?
Shown below are the 2 Python scripts; the console output from each one is too long to post on Stack Overflow, but they can be seen attached to this Gist: https://gist.github.com/jakelevi1996/8a86f2c2257001afc939343891ee5de7
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU, and then performs inference.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Input
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# Create model
model = models.Sequential()
# model.add(Input(shape=x_train.shape[1:], batch_size=batch_size))
model.add(layers.Conv2D(10, (5, 5), activation='relu', padding="same"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10))
# Compile and train model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(
x_train[:10000], y_train[:10000], validation_data=(x_test, y_test),
batch_size=100, epochs=1,
)
# Save model
print("Saving model...")
current_dir = os.path.dirname(os.path.abspath(__file__))
model_dir = os.path.join(current_dir, "CNN_MNIST")
if not os.path.isdir(model_dir): os.makedirs(model_dir)
# model.save(model_dir)
tf.saved_model.save(model, model_dir)
# Convert to TRT format
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
converter = trt.TrtGraphConverterV2(input_saved_model_dir=model_dir)
converter.convert()
converter.save(trt_model_dir)
t1 = perf_counter()
print("Finished TRT conversion; time taken = {:.3f} s".format(t1 - t0))
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# TEMPORARY: just use 100 test points to minimise GPU memory
num_points = 100
x_test, y_test = x_test[:num_points], y_test[:num_points]
current_dir = os.path.dirname(os.path.abspath(__file__))
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
t1 = perf_counter()
print("Finished inference; time taken = {:.3f} s".format(t1 - t0))
I had the same error on a Jetson Tx2. I think it comes from the shared memory between the GPU and the CPU, tensorflow doesn't allow enough memory or the os limit the allocation.
To fix this, you can allow memory growth:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Or you can force tensorflow to allocate enough memory:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Those example comes from https://www.tensorflow.org/guide/gpu
I see in logs that it created GPU device with 600 Mb:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 638 MB memory)
And then it tried to allocate 1Gb:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.00GiB (rounded to 1073742336).
Also it's clear. that GPU device has more memory than 600Mb. It's visible here in the logs:
2020-06-23 23:06:36.463934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
So maybe your GPU is running some other calculation?

Resources