problem for change batch size in my model - out-of-memory

When I train my model(my model is a transformer that its input is featured extracted from T5 model and Vit )
I have problem for set batch_size more than 2 number
number of image is 25000 for training.
GPU is GTX 3090(24 gpu ram).
24 core multithreading CPU.
number of total parameter =363M
seq_len=512
max-step=100000/2
iter=100000
img:torch.Size([3, 384, 500])
tokens:torch.Size([512])
I want to increase batch_size from 2 to 3,4,... but I can't. and I see error
for example when I set batch_size=4, I have this error
CUDA out of memoryTried to allocat....
(I attache image for error)
But when I decrease to 2 I have not this error .
What's I wrong?
enter image description here

The problem is as stated. You are running out of GPU memory. If you want to increase batch size and you are using pytorch lightning, try to use half-precision to consume less memory. https://pytorch-lightning.readthedocs.io/en/latest/common/precision_basic.html

Related

Check No.of nvidia cores utilized

Is there a way to check the number of stream processors and cores utilized by an OpenCL kernel?
No. However you can make guesses based on your application: If the number of work items is much larger than the number of CUDA cores and the work group size is 32 or larger, all stream processors are used at the same time.
If the number of work items is the about the same or lower than the number of CUDA cores, you won't have full utilization.
If you set the work size to 16, only half of the CUDA cores will be used at any time, but the non-used half is blocked and cannot do other work. (So always set work group size to 32 or larger.)
Tools like nvidia-smi can tell you the time-averaged GPU usage. So if you run your kernel over and over without any delay in between, the usage indicates the average fraction of used CUDA cores at any time.

Memory usage in Dual GPU(Multi GPU)

I am using two GPUs of same configuration for my HPC GPGPU calculation using OpenCL. One of the card is connected for the display purpose and about 200-300 MB of memory is always used by two programs called compiz and x server. My question is , when using these GPU's for computation I can use only a partial amount of total memory in GPU which is used for display purpose whereas the 2nd GPU I am able to use entire Global memory. In my case I am using two Nvidia Quadro 410, Which has 192 cuda cores , 512 MB as memory but 503 MB usable . In case of display GPU i can use only 128MB for computation and other I can use full 503 MB for calculation.
According to the The OpenCL Specification Page 32
Max size of memory obj
ect allocation
in bytes. The minimum value is max
(1/4
th
of
CL_DEVICE_GLOBAL_MEM_SIZE
,
128*1024*1024)
Also shouldn't this hold good for all the GPU's present in the System?
Just continue to read from that point, you will see
Max size of memory object allocation
in bytes. The minimum value is max
(1/4th of
CL_DEVICE_GLOBAL_MEM_SIZE ,
128*1024*1024)
so whichever is greater, 128MB or 1/4 of total; will be the limit.
OpenCL will automatically swap data out or it the GPU, so you are not actually limited to the GPU global memory, you can have more memory used, as long as you don use it all at once. You can "obviusly" not create objects such big that don fit on the GPU memory. That is where this limit kicks in.
The current max limit per object is as pointed out by #huseyin
CL_DEVICE_MAX_MEM_ALLOC_SIZE (cl_ulong)
Max size of memory object allocation in bytes. The minimum value is max
(1/4 th of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)
minimun_max_global_size = max(1/4*global, 128MB)
If you read it carefuly, is the min value for the max allocation size.(tricky wording!).
Probably nVIDIA is setting it to 1/4 on a display GPU, and to whole memory size on the non-display GPU. But nVIDIA is following the spec by doing so in both cases.
It is something you should query, and operate within the limits the API reports. You cannot change it, and you should not guess it.

ffdf object consumes extra RAM (in GB)

I have decided to test the key advantage of ff package - RAM minimal allocation (PC specs: i5, RAM 8Gb, Win7 64 bit, Rstudio).
According to the package discription we can manipulate physical objects (files) like virtual ones as if they are allocated into RAM. Thus, actual RAM usage is reduced greatly (from Gb to kb). The code I have used as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=100000, next.rows=100000000,levels=NULL))
print(object.size(x)/1024/1024)
print(class(x))
The actual file size is 4.5 Gb, the actual RAM used varies in such a way (by Task Manager): 2,92 -> upper limit(~8Gb) -> 5.25Gb.
The object size (by object.size()) is about 12 kb.
My concern is about RAM extra allocations (~2.3 GB). According to the package discription it should have increased only by 12 kb. I dont use any characters.
Maybe I have missed something of ff package.
Well, I have found a solution to eliminate the use of extra RAM.
First of all it is necessary to pay attention to such arguments as 'first.rows' and 'next.rows' of method 'read.table.ffdf' in ff package.
The first argument ('first.rows') stipulates the initial chunk in row quantity and it stipulates the initial memory allocation. I have used the default value (1000 rows).
The extra memory allocation is the subject of the second argument ('next.rows'). If you want to have ffdf object without extra RAM allocations (in my case - in Gb) so you need to select such a number of rows for the next chunk that the size of the chunk should not exceed the value of 'getOption("ffbatchbytes")'.
In my case I have used 'first.rows=1000' and 'next.rows=1000' and the total RAM allocation has varied up to 1Mb in Task Manager.
The increase of 'next.rows' up to 10000 has caused the RAM growth by 8-9 Mb.
So this arguments are subject to your experiments to pick up the best proportion.
Besides, you must keep in mind that the increase of 'next.rows' will impact the processing time to make ffdf object(by several runs):
'first.rows=1000' and 'next.rows=1000' is around 1500 sec. (RAM ~ 1Mb)
'first.rows=1000' and 'next.rows=10000' is around 230 sec. (RAM ~ 9Mb)

why GTX630 is faster for me than GTX650Ti for opencl?

I am using two graphics card for opencl code
using profiling, my GTX 630 kepler is running faster than GTX650 Ti for each method request.
after profiling i found out some differences for both graphics card. But i am not able to understand what occupancy, l1_global_load_hit, l1_global_load_miss, active_warps and active_cycles are less for GTX650 Ti. Can any one please help me understand these terms in a more better way.
Decrease local work group size from 1024 down to 512 or 256 maybe even 64, then try again. This will leave more local memory per wave of threads. So more will execute simultaneously henceforth occupying more ALUs.
Don't forget to make the total number of threads a multiple of 768(number of cores of your faster card) to actually fill it evenly through all cores.(not just multiple of 384 like 1k-is which is not good for your faster card)

Limit the number of compute units used by OpenCL

I need to limit the number of compute units used by my opencl application.
I'm running it on a CPU that has 8 compute units, I've seen that with CL_DEVICE_MAX_COMPUTE_UNITS.
The execution time I get with OpenCL is much less than 8 times the normal algorithm without OpenCL (is like 600 time faster). I want to use just 1 compute units because I need to see the real improvement with the same code optimized by OpenCL.
It's just for testing, the real application will continue to use all the compute units.
Thanks for your help
If you are using CPUs, Why dont you try using the OpenCL device fission extension ?
Device Fission allows you to split up a computer unit into sub-devices. You can then create a command queue to the subdevice and enqueue kernels only to that subset of your CPU cores,
You can divide your 8 core device into 8 subdevices of 1 core each for example.
Take a look at the Device Fission example in the AMD APP SDK.

Resources