Q: R Torch memory usage on GPU for convolutional LSTM tutorial - r

The convolution LSTM tutorial linked from the Torch for R frontpage is written to run on cpu by default -- no calls to a gpu device. I can set up and run it just fine.
When I make a GPU modification, as follows in the code below,
model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3,3), n_layers = 2)
#---CUDA modification
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)
#---
optimizer <- optim_adam(model$parameters)
num_epochs <- 100
for (epoch in 1:num_epochs) {
model$train()
batch_losses <- c()
for (b in enumerate(dl)) {
optimizer$zero_grad()
preds <- model(b$x$to(device = device))[[2]][[2]][[1]] #last time step output from the last layer
loss <- nnf_mse_loss(preds, b$y$to(dtype = torch_float(), device = device))
batch_losses <- c(batch_losses, loss$item())
loss$backward()
optimizer$step()
}
if(epoch %%10 ==0)
cat(sprintf("\nEpoch %d, training loss: %3f\n", epoch, mean(batch_losses)))
}
I get 40-50 epochs in before encountering an obvious "your GPU outta memory," error:
Error in (function (self, gradient, retain_graph, create_graph) :
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 6.00 GiB total capacity; 4.40 GiB already allocated; 10.19 MiB free; 4.74 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFF51AFA7B200007FFF51AFA750 c10.dll!c10::Error::Error [<unknown file> # <unknown line number>]
My GPU is pretty dated, a 2014-2015 GeForce GTX 970M, 3GB of memory native. But the tensors in this example are not particularly large. They're dim(100 6 1 24 24) synthetic tensors, although admittedly this structure is preserving all of the hidden and cell states. I don't have the background to calculate what this 'should' be using, but my intuition is that something about this setup (or the current R torch implementation) isn't cleaning up after itself, particularly through training epochs.
Is anyone able to either reproduce or easily run a GPU modification of my example, and is there a straightforward solution here or is there simply a fundamental limit to the capacity of my GPU in this case?

Related

Getting “OutOfMemory Error in GpuMemory: 0” from small CNN and small data-set

My objective is to train a very simple CNN on MNIST using Tensorflow, convert it to TensorRT, and use it to perform inference on the MNIST test set using TensorRT, all on a Jetson Nano, but I am getting several errors and warnings, including “OutOfMemory Error in GpuMemory: 0”. To try and reduce memory footprint, I tried also creating a script where I simply load the TensorRT model (that had already been converted and saved in the previous script) and use it to perform inference on a small subset of the MNIST test set (100 floating point values), but I am still getting the same out of memory error. The entire directory containing the TensorRT model is only 488 KB, and the 100 test points can’t be taking up very much memory, so I am confused about why GPU memory is running out. What could be the reason for this, and how can I solve it?
Another thing which seems suspicious is that some of the Tensorflow logging info messages are being printed multiple times, EG “Successfully opened dynamic library libcudart”, “Successfully opened dynamic library libcublas”, “ARM64 does not support NUMA - returning NUMA node zero”. What could be the reason for this (EG dynamic libraries being opened over and over again), and could this have something to do with why the GPU memory keeps running out?
Shown below are the 2 Python scripts; the console output from each one is too long to post on Stack Overflow, but they can be seen attached to this Gist: https://gist.github.com/jakelevi1996/8a86f2c2257001afc939343891ee5de7
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU, and then performs inference.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Input
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# Create model
model = models.Sequential()
# model.add(Input(shape=x_train.shape[1:], batch_size=batch_size))
model.add(layers.Conv2D(10, (5, 5), activation='relu', padding="same"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10))
# Compile and train model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(
x_train[:10000], y_train[:10000], validation_data=(x_test, y_test),
batch_size=100, epochs=1,
)
# Save model
print("Saving model...")
current_dir = os.path.dirname(os.path.abspath(__file__))
model_dir = os.path.join(current_dir, "CNN_MNIST")
if not os.path.isdir(model_dir): os.makedirs(model_dir)
# model.save(model_dir)
tf.saved_model.save(model, model_dir)
# Convert to TRT format
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
converter = trt.TrtGraphConverterV2(input_saved_model_dir=model_dir)
converter.convert()
converter.save(trt_model_dir)
t1 = perf_counter()
print("Finished TRT conversion; time taken = {:.3f} s".format(t1 - t0))
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# TEMPORARY: just use 100 test points to minimise GPU memory
num_points = 100
x_test, y_test = x_test[:num_points], y_test[:num_points]
current_dir = os.path.dirname(os.path.abspath(__file__))
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
t1 = perf_counter()
print("Finished inference; time taken = {:.3f} s".format(t1 - t0))
I had the same error on a Jetson Tx2. I think it comes from the shared memory between the GPU and the CPU, tensorflow doesn't allow enough memory or the os limit the allocation.
To fix this, you can allow memory growth:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Or you can force tensorflow to allocate enough memory:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Those example comes from https://www.tensorflow.org/guide/gpu
I see in logs that it created GPU device with 600 Mb:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 638 MB memory)
And then it tried to allocate 1Gb:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.00GiB (rounded to 1073742336).
Also it's clear. that GPU device has more memory than 600Mb. It's visible here in the logs:
2020-06-23 23:06:36.463934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
So maybe your GPU is running some other calculation?

How to bootstrap using large datasets?

I would like to use the boot() and boot.ci() functions from library("boot") for a large data set(~20 000) with type="bca".
If R(number of bootstraps) is too small (I have tried 1k - 10k), then I get the following error:
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, :
estimated adjustment 'a' is NA
However, if I do 15k - 20+k bootstraps, then I get:
Cannot allocate vector size # GB
(usually ranging from 1.7 to 6.4gb, depending on the dataset and # of bootstraps).
I read that I needed to have more ram, but I have Windows desktop with 16gb ram and I'm using 64-bit R, suggesting my computer should be able to handle this.
How can I use bootstrapping methods on larger datasets if too few bootstraps cannot produce estimates and sufficient bootstraps results in insufficient memory?
My code:
multRegress<-function(mydata){
numVar<<-NCOL(mydata)
Variables<<- names(mydata)[2:numVar]
mydata<-cor(mydata, use="pairwise.complete.obs")
RXX<-mydata[2:numVar,2:numVar]
RXY<-mydata[2:numVar,1]
RXX.eigen<-eigen(RXX)
D<-diag(RXX.eigen$val)
delta<-sqrt(D)
lambda<-RXX.eigen$vec%*%delta%*%t(RXX.eigen$vec)
lambdasq<-lambda^2
beta<-solve(lambda)%*%RXY
rsquare<<-sum(beta^2)
RawWgt<-lambdasq%*%beta^2
import<-(RawWgt/rsquare)*100
result<<-data.frame(Variables, Raw.RelWeight=RawWgt,
Rescaled.RelWeight=import)
}
# function passed to boot
multBootstrap <- function(mydata, indices){
mydata<-mydata[indices,]
multWeights<-multRegress(mydata)
return(multWeights$Raw.RelWeight)
}
# call boot
multBoot<-boot(thedata, multBootstrap, 15000)
multci<-boot.ci(multBoot,conf=0.95, type="bca")

Using parallel processing in vegan functions?

I am interested in executing the R function adonis from the vegan package in parallel. However, it isn't clear to me how exactly to make it run in parallel. Regardless of how I try to initialize it, it seems to take the same amount of time to execute. Can someone explain what I am doing wrong?
require(vegan)
require(parallel)
data(dune)
data(dune.env)
#This:
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
#Runs faster (4.49 s) than this (6.7 s):
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
#or this (6.7 s)
cl <- makeCluster(3)
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=cl))
stopCluster(cl)
Computer details:
R V4.0
Win 10x64
i5-8350 4 cores
I'm not sure how helpful this answer will really be, but I'll share a few of my own observations and things I've slowly pieced together. I don't pretend to be an expert on this, so take my answer realizing there may be some inaccuracies in here. I'm a biologist first.
Some of these parallel libraries seem to reload the r-environment and run any start up files (e.g. rprofiles) you have per each core. So, there is an inherent time cost using the parallel libraries that makes it so that you will only see benefits to parallel functions if you it is a large enough computation to be worth the parallelization (in your example, the Dune dataset is really small. I'll share my own benchmarks below). That said, there are a few things that seem to help.
Using the doParallel library, you can specify arguments to not load unnecessary information into your session like so:
library(doParallel)
cl <- makeCluster(3, rscript_args = c("--no-init-file", "--no-site-file","--no-environ"))
#for linux .... cl <- makePSOCKcluster(2)
registerDoParallel(cl)
unif_w = UniFrac(d, weighted=T, parallel=T, normalized = T)
unif_uw = UniFrac(d, weighted=F, parallel=T)
stopCluster(cl)
I noticed in my own work that the addition of the rscript option greatly enhanced my speeds (sorry, no benchmarks for this, hoping to get a quick anwer out). If I remember the source where I got that suggestion from I'll come back to share.
This doesn't help with running Adonis, however I think that initial time cost might explain why we don't see a time benefit using the parallel options built in to Adonis on the Dune dataset. Here are my benchmarks.
> data("dune")
> data("dune.env")
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
user system elapsed
3.90 0.00 3.93
> #Runs faster (4.49 s) than this (6.7 s):
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
user system elapsed
0.71 0.04 6.53
Not a big difference on this set, but it IS slower in parallel. However, repeated with a large set I'm working with at the moment (bc is a distance matrix was calculated from species matrix that has 887 species by 3734 sites)
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 1))
user system elapsed
109.95 21.27 131.22
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 4))
user system elapsed
3.44 1.41 82.36
Long story short, in this specific case you might only benefits by applying the adonis option to a larger dataset.
I'm not sure how important computer specs are here, but I do have a large bit of memory intended for this kind of purpose. The memory in my case is more important for allowing me to work with large matrices a little easier.
R version: 4.0.2
Windows 10, 64bit
AMD Ryzen 3600
64gb DRAM
Anyways, I'm still looking for other work-arounds and tricks.

Can I make this R foreach loop faster?

Thanks in advance for your help.
The short of this is that I have huge foreach loops that are running much slower than I'm used to, and I'm curious as to whether I can speed them up -- it's taking hours (maybe even days).
So, I've been given two large pieces of data ( by friend's who needs help). The first is a very large matrix (728396 rows by 276 columns) of genetic data for 276 participants (I'll call this M1). The second is a dataset (276 rows and 34 columns) of other miscellaneous data about the participants (I'll call this DF1). We're running a multilevel logistic regression model utilizing both sets of data.
I'm using a Windows PC with 8 virtual cores running at 4.7ghz and 36gb of ram.
Here's a portion of the code I've written/modified:
library(pacman)
p_load(car, svMisc, doParallel, foreach, tcltk, lme4, lmerTest, nlme)
load("M1.RDATA")
load("DF1.RDATA")
clust = makeCluster(detectCores() - 3, outfile="")
#I have 4 physical cores, 8 virtual. I've been using 5 because my cpu sits at about 89% like this.
registerDoParallel(clust)
getDoParWorkers() #5 cores
n = 728396
res_function = function (i){
x = as.vector(M1[i,])
#Taking one row of genetic data to be used in the regression
fit1 = glmer(r ~ x + m + a + e + n + (1 | famid), data = DF1, family = binomial(link = "logit"))
#Running the model
c(coef(summary(fit1))[2,1:4], coef(summary(fit1))[3:6,1], coef(summary(fit1))[3:6,4], length(fit1#optinfo[["conv"]][["lme4"]][["messages"]]))
#Collecting data, including whether there are any convergence error messages
}
start_time = Sys.time()
model1 = foreach(i = 1:n, .packages = c("tcltk", "lme4"), .combine = rbind) %dopar% {
if(!exists("pb")) pb <- tkProgressBar("Parallel task", min=1, max=n)
setTkprogressBar(pb, i)
#This is some code I found here to keep track of my progress
res_function(i)
}
end_time = Sys.time()
end_time - start_time
stopCluster(clust)
showConnections()
I've run nearly identical code in the past and it took me only about 13 minutes. However, I suspect that this model is taking up more memory than usual on each core (likely due to the second level) and slowing things down. I've read that BiocParallel, Future, or even Microsoft R Open might work better, but I haven't had much success using any of them (likely due to my own lack of know how). I've also read a bit about the package "bigmemory" to more efficiently use the large matrix across cores, but I ran into several errors when I tried to use it (failed workers and such). I'm also curious about the potential of using my GPU (a Titan X Pascal) for some additional umph if anyone knows more about this.
Any advice would be very appreciated!

Reducing NbClust memory usage

I need some help with massive usage of memory by the NbClust function.
On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:
if (any(indice == 23) || (indice == 32)) {
res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1,
md = md)$gamma
Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):
for (k in 1:nwithin1) {
s.plus <- s.plus + (colSums(outer(between.dist1,
within.dist1[k], ">")))
s.moins <- s.moins + (colSums(outer(between.dist1,
within.dist1[k], "<")))
print(s.moins)
}
I'm guessing that the memory usage comes from the outer() function.
Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)?
At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.
Edit: I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:
set.seed(123)
cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
name = as.character(cm)
m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
mlist[[name]] = m
}
test_data = do.call(cbind, cbind(mlist))
library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30,
method = "ward.D2", index = "alllong", alphaBeale = 0.1)
debug: hc <- hclust(md, method = "ward.D2")
It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.
So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.
Edit:
I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:
Browse[2]>
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb
If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.
The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?
I forked NbClust and made some changes that seem to have made it go for longer without crashing with bigger matrices. I changed some of the functions to use Rfast, propagate and fastcluster. However there are still problems.
I haven't run all my data yet and only run a few tests on dummy data with gap, so there is still time for it to fail. But any suggestions/criticisms would be welcome.
My (in progress) fork of NbCluster:
https://github.com/jbhanks/NbClust

Resources