I have a couple of questions about a statefull cuDNN LSTM model I'm trying to fit in R using keras library. I have tensorflow-gpu installed and it seems to be running sucessfully.
The first thing I'm wondering about is the speed of model training which only seems to increase by a factor 1.3 using cuDNN lstm instead of ordinary LSTM. I have read other cases where people got models that train 10 or even 15 times faster when using cudnn lstm compared to normal lstm. I will post some code below.
Moreover I'm wondering about the percentage of memory usage of GPU. When code is run, it only seems to take roughly 8 % of GPU memory which seems a bit low. Can this be connected with the lack of increased speed.
dim(x.train) = (208, 1, 4)
dim(y.train) = (208 , 1)
For validation sets its the same except tat 208 is replaced with 42.
batch_size = 1
model <- keras_model_sequential()
model %>% layer_cudnn_lstm(units = 1, batch_input_shape = c(1,1,4),
stateful = TRUE, return_sequences = FALSE) %>%
layer_dropout(rate = dropout) %>%
layer_dense(units = 0.01)
model %>% compile(
loss = 'mean_squared_error',
optimizer = optimizer_adam(lr= 0.01, decay = 1e-8),
metrics = c('mean_squared_error')
)
Epochs <- 500
hist_temp <- model %>% fit(x.train, y.train, epochs=1, batch_size=batch_size, verbose=1, shuffle=FALSE,
validation_data = list(x.train_val, y.test))
model %>% reset_states()
Im expecting it to be much faster and more demanding on the GPU memory. What have I missed here?
this could have multiple reasons for example:
You have created a bottleneck while reading the data. You should check the cpu, memory and disk usage. Also you can increase the batche-size to maybe increase the GPU usage, but you have a rather small sample size. Morover a batch-size of 1 isn't realy common;)
2.You have a very small network so that you don't profit from GPU accleration as much. You can try to increase the size of the network to test if the GPU usage increases.
I hope this helps.
Related
The convolution LSTM tutorial linked from the Torch for R frontpage is written to run on cpu by default -- no calls to a gpu device. I can set up and run it just fine.
When I make a GPU modification, as follows in the code below,
model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3,3), n_layers = 2)
#---CUDA modification
device <- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
model <- model$to(device = device)
#---
optimizer <- optim_adam(model$parameters)
num_epochs <- 100
for (epoch in 1:num_epochs) {
model$train()
batch_losses <- c()
for (b in enumerate(dl)) {
optimizer$zero_grad()
preds <- model(b$x$to(device = device))[[2]][[2]][[1]] #last time step output from the last layer
loss <- nnf_mse_loss(preds, b$y$to(dtype = torch_float(), device = device))
batch_losses <- c(batch_losses, loss$item())
loss$backward()
optimizer$step()
}
if(epoch %%10 ==0)
cat(sprintf("\nEpoch %d, training loss: %3f\n", epoch, mean(batch_losses)))
}
I get 40-50 epochs in before encountering an obvious "your GPU outta memory," error:
Error in (function (self, gradient, retain_graph, create_graph) :
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 6.00 GiB total capacity; 4.40 GiB already allocated; 10.19 MiB free; 4.74 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:272 (most recent call first):
00007FFF51AFA7B200007FFF51AFA750 c10.dll!c10::Error::Error [<unknown file> # <unknown line number>]
My GPU is pretty dated, a 2014-2015 GeForce GTX 970M, 3GB of memory native. But the tensors in this example are not particularly large. They're dim(100 6 1 24 24) synthetic tensors, although admittedly this structure is preserving all of the hidden and cell states. I don't have the background to calculate what this 'should' be using, but my intuition is that something about this setup (or the current R torch implementation) isn't cleaning up after itself, particularly through training epochs.
Is anyone able to either reproduce or easily run a GPU modification of my example, and is there a straightforward solution here or is there simply a fundamental limit to the capacity of my GPU in this case?
I'm trying to get into deep learning with R. Using various blogs online I'm trying to test their code and see how they actually work. With keras, I'm not sure why but everytime I run a modelfunction It keeps crashing.
I'm sorry if I haven't provided enough information. I'm running an AMD GPU and CPU
Example code section
history <- model %>% fit_generator(
train_generator,
steps_per_epoch = 100,
epochs = 100,
validation_data = validation_generator,
validation_steps = 50,
)
use_multiprocessing=False
an also
hist <- model %>% fit_generator(
# training data
train_image_array_gen,
# epochs
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = epochs,
# validation data
validation_data = valid_image_array_gen,
validation_steps = as.integer(valid_samples / batch_size),
# print progress
verbose = 2,
callbacks = list(
# save best model after every epoch
callback_model_checkpoint("C:/Users/My Account/Desktop/fruits_checkpoints.h5", save_best_only = TRUE),
)
It looks like the problem is with Keras using tensorflow-gpu. Try running the model after installing tensorflow cpu version. Since, you are using AMD gpu, you may not be able to use the tensorflow gpu version with cudnn libraries.
I am trying to fit a Random Forest model with caret. My training data weight 129MB and I'm computing this on Google Cloud with 8 cores and 52GB RAM. The code I'm using is below:
library(caret)
library(doParallel)
cl <- makeCluster(3, outfile = '')
registerDoParallel(cl)
model <- train(x = as.matrix(X_train),
y = y_train,
method = 'rf',
verbose = TRUE,
trControl = trainControl(method = 'oob',
verboseIter = TRUE,
allowParallel = TRUE),
tuneGrid = expand.grid(mtry = c(2:10, 12, 14, 16, 20)),
num.tree = 100,
metric = 'Accuracy',
performance = 1)
stopCluster(cl)
Despite having 8 cores, any try to use more than 3 cores in makeCluster results in the following error:
Error in unserialize(socklist[[n]]) : error reading from connection
So I thought maybe there is a problem with memory allocation and tried with only 3 cores. After few hours of training when I was expecting to have a result the only thing I got, to my amazement, was the following error:
Error: cannot allocate vector of size 1.9 Gb
Still, my google cloud instance has 52GB memory so I decided to check how much out of it is currently free.
as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE))
[1] 5606656
Which is above 47GB. So assuming that 2GB couldn't be allocated in the end of training it seems that above 45GB was employed by training random forest. I know that my training dataset is bootstrapped 100 times to grow a random forest, so 100 copies of training data weight around 13GB. At the same time my total RAM is divided to 3 clusters, what gives me 39GB. It should leave me with around 6GB, but it apparently doesn't. Still, this is assuming that no memory is released after building separates trees and I doubt this is a case.
Therefore, my questions are:
Are my approximate calculations even ok?
What may cause my errors?
How can I estimate how much RAM I need to train a model with my training data?
You cannot correctly estimate the size of the random forest model, because the size of those decision trees is something that varies with the specific resample of data - i.e. the trees are built dynamically with stopping criteria that depends on the data distribution.
I'm trying to run a decision tree via caret package. I start my script fresh by removing everything from memory with rm(list = ls()) then I load my training data which is 3M rows and 522 features. R studio doesn't show the size in gb but presumably by the error message it's 11.6.
If I'm using 64gb R then is it expected I see this error? Is there any way around it without resorting to training on smaller data?
rm(list = ls())
library(tidyverse)
library(caret)
library(xgboost)
# read in data
training_data <- readRDS("/home/myname/training_data.rds")
R studio environment pane currently shows one object, training data with the dims mentioned above.
### Modelling
# tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
allowParallel = TRUE
)
# Fit a decision tree (minus cad field)
print("begin decision tree regular")
mod_decitiontree <- train(
cluster ~.,
tuneLength = 5,
data = select(training_data, -c(cad, id)), # a data frame
method = "rpart",
trControl = train_control,
na.action = na.pass
)
Loading required package: rpart
Error: cannot allocate vector of size 11.6 Gb
I could ask our admin to increase my RAM but before doing that want to make sure I'm not missing something. Don't I have lot's of RAM available if I'm on 64 GB?
Do I have any options? I tried making my data frame a matrix and passing that to caret instead but it threw an error. Is passing a matrix instead a worthwhile endevour?
Here is your error message reproduced:
cannot allocate vector of size 11.6 Gb when trying a decision tree
This means that the specific failure happened when R requested another 11.6 GB of memory, and was unable to do so. However, the random forest calculation itself may require many such allocations, and, most likely, the remainder of free RAM was already being used.
I don't know the details of your calculation, but I would say that even running random forests on a 1GB data set is already very large. My advice would be to find a way to take a statistically accurate sub sample of your data set such that you don't need such large numbers of RAM.
I have a problem with neuralnet function from neuralnet package in R.
I designed a simple structure with 82 feature as input and only 1 hidden layer with 10 neurons and output is 20 class and I left this line which represent neuralnet function to run above 4 hours and didn't finish !
This is the code :
nn=neuralnet(f, data = train, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)
Training of the neural network can be arbitrary long, what affects this time?
Complexity of the network (not a problem here as your network is quite small)
Size of the training data - even few thousands of samples can take quite a while, furthermore number of features also significantly increase computation time
Training algorithm and its hyperparameters - in particular for SGD based solutions - too small learning rate (or to big as it causes the oscilation)
Type of stopping criterion - there are many ways of checking whether to stop training a NN, some more expensive (validation score) than others (amplitude of gradient/number of epochs).
In your particular example your training takes at most 100,000 steps and you use rprop+ learning. Thus the most probable problem is the size of the training data. You can try to set stepmax to some much smaller value to see how much time it needs and how good is the model.
In general - neural networks are hard and slow to train, you have to deal with it or switch to other models.
You can easily predict the computation time and complexity of your code before running it on the full data with the GuessCompx package.
Create fake data with the same characteristics as yours, and 20-class Y vector and a wrapper function:
train = data.frame(matrix(rnorm(300000*82, 3), ncol=82))
train['Y'] = as.character(round(runif(300000, 1,20)))
nn_test = function(data) {
nn=neuralnet(formula=Y~., data=data, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)
}
And then do the audit:
library(GuessCompx) # get it by running: install.packages("GuessCompx")
library(neuralnet)
CompEst(train, nn_test)
#### $`TIME COMPLEXITY RESULTS`$best.model
#### [1] "NLOGN"
#### $`TIME COMPLEXITY RESULTS`$computation.time.on.full.dataset
#### [1] "1M 4.86S"
#### $`MEMORY COMPLEXITY RESULTS`$best.model
#### [1] "LINEAR"
#### $`MEMORY COMPLEXITY RESULTS`$memory.usage.on.full.dataset
#### [1] "55535 Mb"
#### $`MEMORY COMPLEXITY RESULTS`$system.memory.limit
#### [1] "16282 Mb"
See that the computation time is not a problem, but the memory usage and limitations might be impacting your computer, causing the long delay? The only nn output object takes more than 4Gb to be stored!