My GPU seems to allow 562% use of global memory and 133% use of local memory for a simple PyOpenCL matrix addition kernel. Here is what my script prints:
GPU: GeForce GTX 670
Global Memory - Total: 2 GB
Global Memory - One Buffer: 3.750000 GB
Number of Global Buffers: 3
Global Memory - All Buffers: 11.250000 GB
Global Memory - Usage: 562.585844 %
Local Memory - Total: 48 KB
Local Memory - One Array: 32.000000 KB
Number of Local Arrays: 2
Local Memory - All Arrays: 64.000000 KB
Local Memory - Usage: 133.333333 %
If I increase global memory use much above this point, I get the error: mem object allocation failure
If I increase local memory use above this point, I get the error: invalid work group size
Why doesn't my script fail immediately when memory use of local or global exceeds 100%?
Global size is multiplied by 32, thats the error.
When clearly a float32 has 4bytes, this makes a and b arrays 4 bytes each. Not 32.
So the proper results for you would be:
Global Memory - Total: 2 GB
Global Memory - One Buffer: 0.4687500 GB
Number of Global Buffers: 3
Global Memory - All Buffers: 1.40625 GB
Global Memory - Usage: 70.3125 %
Local Memory - Total: 48 KB
Local Memory - One Array: 4.000000 KB
Number of Local Arrays: 2
Local Memory - All Arrays: 8.000000 KB
Local Memory - Usage: 16.6666666 %
Related
I have been testing the various compression algorithms with parquet files, and have settled on Zstd.
Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.
The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):
Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030
And for comparison the log from using the adaptive dictionary:
Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779
Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?
> adj = as.dist(adj)
Error: cannot allocate vector of size 3.0 Gb
> system("free -h")
total used free shared buff/cache available
Mem: 14G 2.2G 12G 88M 420M 12G
Swap: 0B 0B 0B
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 7799796 416.6 16746142 894.4 16746142 894.4
Vcells 41146681 314.0 520945812 3974.6 1665927139 12710.1
Why can't I allocate a vector of 3.0GB if I have 12GB free? I'm working on Linux CentOS, so I don't have access to memory.limit() and memorize.size().
This general question has been asked a lot. I don't see an answer that addresses my specific question, though. These questions are kind of the same, but the only answers I see are related to reducing the size of the object, using memory.limit() and memory.size(), or using bsub -q server_name -R. I don't know how to use the latter inside a script, and it doesn't address my question of why I can't allocate a vector when (it appears) I have the memory to do so. I've also tried including Sys.setenv('R_MAX_VSIZE'=32000000000) but that did not fix it.
Edit: I added gc() output. Does it matter that the available 12GB are virtual memory??
I am trying to stress a ubuntu container's memory. Typing free in my command terminal provides the following result:
free -m
total used free shared buff/cache available
Mem: 7958 585 6246 401 1126 6743
Swap: 2048 0 2048
I want to stress exactly 10% of the total available memory. Per stress-ng manual:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writing to the allocated
memory. Note that this can cause systems to trip the kernel OOM killer on Linux
systems if not enough physical memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can specify the size as % of
total available memory or in units of Bytes, KBytes, MBytes and GBytes using the
suffix b, k, m or g.
Now, on my target container I run two memory stressors to occupy 10% of my memory:
stress-ng -vm 2 --vm-bytes 10% -t 10
However, the memory usage on the container never reaches 10% no matter how many times I run it. I tried different timeout values, no result. The closet it gets is 8.9% never approaches 10%. I inspect memory usage on my container this way:
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 199.01% 638.4MiB / 7.772GiB 8.02% 1.45kB / 0B 0B / 0B 7
In an attempt to understand this behaviour, I tried running the same command with an exact unit of bytes. In my case, I'll opt for 800 mega since 7958m * 0.1 = 795,8 ~ 800m.
stress-ng -vm 2 --vm-bytes 800m -t 15
And, I get 10%!
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 198.51% 815.2MiB / 7.772GiB 10.24% 1.45kB / 0B 0B / 0B 7
Can someone explain why this is happening?
Another question, is it possible for stress-ng to stress memory usage to 100%?
stress-ng --vm-bytes 10% will use sysconf(_SC_AVPHYS_PAGES) to determine the available memory. This sysconf() system call will return the number of pages that the application can use without hindering any other process. So this is approximately what the free command is returning for the free memory statistic.
Note that stress-ng will allocate the memory with mmap, so it may be that during run time mmap'd pages may not necessarily be physically backed at the time you check how much real memory is being used.
It may be worth trying to also use the --vm-populate option; this will try and ensure the pages are physically populated on the mmap'd memory that stress-ng is exercising. Also try --vm-madvise willneed to use the madvise() system call to hint that the pages will be required fairly soon.
If I use a barrier (no matter if CLK_LOCAL_MEM_FENCE or CLK_GLOBAL_MEM_FENCE) in my kernel, it causes a CL_INVALID_WORK_GROUP_SIZE error. The global work size is 512, the local work size is 128, 65536 items have to be computed, the max work group size of my device is 1024, I am using only one dimension. For Java bindings I use JOCL.
The kernel is very simple:
kernel void sum(global float *input, global float *output, const int numElements, local float *localCopy
{
localCopy[get_local_id(0)] = grid[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE); // or barrier(CLK_GLOBAL_MEM_FENCE)
}
I run the kernel on the Intel(R) Xeon(R) CPU X5570 # 2.93GHz and can use OpenCL 1.2. The calling method looks like
kernel.putArg(aCLBuffer).putArg(bCLBuffer).putArg(elementCount).putNullArg(localWorkSize);
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
But the error is always the same:
[...]can not enqueue 1DRange CLKernel [...] with gwo: null gws: {512} lws: {128}
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]
What I am doing wrong?
This is expected behaviour on some OpenCL platforms. For example, on my Apple system, the CPU device has a maximum work-group size of 1024. However, if a kernel has a barrier inside, then the maximum work-group size for that specific kernel is reduced to 1.
You can query the maximum work-group size for a specific kernel by using the clGetKernelWorkGroupInfo function with the CL_KERNEL_WORK_GROUP_SIZE parameter. The value returned will be no more than the value returned by clGetDeviceInfo and CL_DEVICE_MAX_WORK_GROUP_SIZE, but is allowed to be less (as it is in this case).
I'm running R to handle a file which is about 1Gb in size, filtering it into several smaller files, and then trying to print them out. I'm getting errors like this at different points throughout the process:
**Error: cannot allocate vector of size 79.4 Mb**
A vector of this size should be a non-issue, with how much memory I /should/ be working with. My machine has 24Gb of memory, and the overwhelming majority of that is still free, even when the R environment with these large objects in it is up and running, and I'm seeing the error above.
free -m
total used free shared buffers cached
Mem: 24213 2134 22079 0 55 965
-/+ buffers/cache: 1113 23100
Swap: 32705 0 32705
here is R's response to gc():
corner used (Mb) gc trigger (Mb) max used (Mb)
Ncells 673097 18.0 1073225 28.7 956062 25.6
Vcells 182223974 1390.3 195242849 1489.6 182848399 1395.1
I'm working in Ubuntu 12.04.1 LTS
here are some specs from the machine I'm using:
i7-3930K 3.20 GHz Hexa-core (6 Core)12MB Cache
ASUS P9X79 s2011 DDR3-1333MHZ,1600 upto 64GB
32GB DDR3 ( 8x4GB Module )
128GB SSD drive
Asus nVidia 2GB GTX660 TI-DC2O-2GD5 GeForce GTX 660 Ti i
this is the object I'm attempting to write to file:
dim(plant)
[1] 10409404 13
the 'plant' object is of the class "data.frame". here is one of the lines of code that prompts the error:
write.table(plant, "file.txt", quote=F, sep="\t", row.names=T, col.names=F)
Any help on solving this issue would be much appreciated.
Tries with memory.limit() function!
-
memory.limit(2000)
-