OpenCL: Can you have a `const __global` type or just `__global`? - opencl

I am looking at a piece of OpenCL code. Currently we do a typedef:
"#if __OPENCL_VERSION__ <= 120\n"
"#define " + dataName + "_type __constant\n"
"#else \n"
"#define " + dataName + "_type const __global\n"
Does const __global work or should it just be __global?

If it's a constant, it probably shouldn't be just '__global'. It is okay to use 'const __global' as this will give you good portability but will be stored in global memory. Graphics cards often have a separate address space and cache for constants which is very small relative to global memory, and some (usually old) graphics cards have no cache at all on global memory. If high latency of the constant buffer will hurt your application's performance, and the contant buffer size is small, say a few KB, then you may get better performance using '__constant'. I don't know if OpenCL is obliged to use the constant cache if you specify '__constant'. I suspect it can choose to use read-only global memory anyway or you may get errors when it tries to build the program if you try to allocate too much '__constant' memory, or if it being used by another application. Other devices, such as CPUs, also support OpenCL but I don't think they have special memory for constants.
Your code appears to suggest '__constant' is deprecated after OpenCL1.2 but this is not the case.

Related

OpenCL vector data type usage

I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.

How to set memory limit to a process in Golang

I use syscall prlimit to set resource limit to process, it works for limit CPU time but when test memory usage, I come with the problem.
package sandbox
import (
"syscall"
"unsafe"
)
func prLimit(pid int, limit uintptr, rlimit *syscall.Rlimit) error {
_, _, errno := syscall.RawSyscall6(syscall.SYS_PRLIMIT64, uintptr(pid), limit, uintptr(unsafe.Pointer(rlimit)), 0, 0, 0)
var err error
if errno != 0 {
err = errno
return err
} else {
return nil
}
}
and it's my test.
func TestMemoryLimit(t *testing.T) {
proc, err := os.StartProcess("test/memo", []string{"memo"}, &os.ProcAttr{})
if err != nil {
panic(err)
}
defer proc.Kill()
var rlimit syscall.Rlimit
rlimit.Cur = 10
rlimit.Max = 10 + 1024
prLimit(proc.Pid, syscall.RLIMIT_DATA, &rlimit)
status, err := proc.Wait()
if status.Success() {
t.Fatal("memory test failed")
}
}
this is memo:
package main
func main() {
var a [10000][]int
for i := 0; i < 1000; i++ {
a[i] = make([]int, 1024)
}
}
I make a large amount of memory and set only 10 bytes for the memory , but it won't signal segment fault signal any way.
RLIMIT_DATA describes the maximum size of a processes data segment. Traditionally, programs that allocate memory enlarge the data segment with calls to brk() to allocate memory from the operating system.
Go doesn't use this approach. Instead, it uses a variant of the mmap() system call to request regions of memory anywhere in the address space. This is much more flexible than a brk() based approach as you can deallocate arbitrary memory regions with munmap(), whereas a brk() based approach can only deallocate memory from the end of the data segment.
The result of this is that RLIMIT_DATA is inneffective in controlling the amount of memory a process uses. Try using RLIMIT_AS instead, but beware that this limit also incorporates the address space you use for file mappings, especially in the case of shared libraries.
There is one proposal Soft memory limit, it may be released after go 1.18
This option comes in two flavors: a new runtime/debug function called SetMemoryLimit and a GOMEMLIMIT environment variable. In sum, the runtime will try to maintain this memory limit by limiting the size of the heap, and by returning memory to the underlying platform more aggressively. This includes with a mechanism to help mitigate garbage collection death spirals. Finally, by setting GOGC=off, the Go runtime will always grow the heap to the full memory limit.
This new option gives applications better control over their resource economy. It empowers users to:
Better utilize the memory that they already have,
Confidently decrease their memory limits, knowing Go will respect them,
Avoid unsupported forms of garbage collection tuning.
Update
This feature will be released in Go 1.19
The runtime now includes support for a soft memory limit. This memory limit includes the Go heap and all other memory managed by the runtime, and excludes external memory sources such as mappings of the binary itself, memory managed in other languages, and memory held by the operating system on behalf of the Go program.
This limit may be managed via runtime/debug.SetMemoryLimit or the equivalent GOMEMLIMIT environment variable.
The limit works in conjunction with runtime/debug.SetGCPercent / GOGC, and will be respected even if GOGC=off, allowing Go programs to always make maximal use of their memory limit, improving resource efficiency in some cases.
Here are some suggestions about the memory limit usage per A Guide to the Go Garbage Collector
While the memory limit is a powerful tool, and the Go runtime takes steps to mitigate the worst behaviors from misuse, it's still important to use it thoughtfully. Below is a collection of tidbits of advice about where the memory limit is most useful and applicable, and where it might cause more harm than good.
Do take advantage of the memory limit when the execution environment of your Go program is entirely within your control, and the Go program is the only program with access to some set of resources (i.e. some kind of memory reservation, like a container memory limit).
A good example is the deployment of a web service into containers with a fixed amount of available memory.
In this case, a good rule of thumb is to leave an additional 5-10% of headroom to account for memory sources the Go runtime is unaware of.
Do feel free to adjust the memory limit in real time to adapt to changing conditions.
A good example is a cgo program where C libraries temporarily need to use substantially more memory.
Don't set GOGC to off with a memory limit if the Go program might share some of its limited memory with other programs, and those programs are generally decoupled from the Go program. Instead, keep the memory limit since it may help to curb undesirable transient behavior, but set GOGC to some smaller, reasonable value for the average case.
While it may be tempting to try and "reserve" memory for co-tenant programs, unless the programs are fully synchronized (e.g. the Go program calls some subprocess and blocks while its callee executes), the result will be less reliable as inevitably both programs will need more memory. Letting the Go program use less memory when it doesn't need it will generate a more reliable result overall. This advice also applies to overcommit situations, where the sum of memory limits of containers running on one machine may exceed the actual physical memory available to the machine.
Don't use the memory limit when deploying to an execution environment you don't control, especially when your program's memory use is proportional to its inputs.
A good example is a CLI tool or a desktop application. Baking a memory limit into the program when it's unclear what kind of inputs it might be fed, or how much memory might be available on the system can lead to confusing crashes and poor performance. Plus, an advanced end-user can always set a memory limit if they wish.
Don't set a memory limit to avoid out-of-memory conditions when a program is already close to its environment's memory limits.
This effectively replaces an out-of-memory risk with a risk of severe application slowdown, which is often not a favorable trade, even with the efforts Go makes to mitigate thrashing. In such a case, it would be much more effective to either increase the environment's memory limits (and then potentially set a memory limit) or decrease GOGC (which provides a much cleaner trade-off than thrashing-mitigation does).

Which is the suitable Memory for this OpenCL Kernel?

I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.
Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.
If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .

Is the access performance of __constant memory as same as __global memory on OpenCL

As I know. Constant memory on CUDA is a specific memory. And it is faster than global memory.
But in OpenCL's Spec. I get the following words.
The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables
So the __constant memory is from the __global memory. Does that mean it have the same accessing performance with the __global memory?
It depends on the hardware and software architecture of the OpenCL platform you are using. For example, one can envision an architecture with read-only caches that don't need to participate in cache coherency. These caches could be used for constant memory but not global memory. So you might see faster accesses to constant memory.
That being said, none of the architectures I'm familiar with operate this way. So that's just hypothetical.
The OpenCL standard does not specify how constant memory should be implemented, but in NVIDIA GPUs constant memory is cached. I don't know what AMD does.
To answer this in 2022 (ten years after)...
I have to !
"
But in OpenCL's Spec. I get the following words.
The __constant or constant address space name is used to describe variables allocated in global memory and which are accessed inside a kernel(s) as read-only variables
So the __constant memory is from the __global memory. Does that mean it have the same accessing performance with the __global memory?
"
CUDA and OpenCL
"__constant__" memory is fast when all threads of a warp access the same memory address. Accessing other locations slows performance since access is then serialized. Some OpenCL FFT implementation used to have it's twiddle factors in a __constant__ memory but the access pattern (address) was thread id dependent. Hacking and binary editing the code was 60% faster on NVIDIA when saying just __global__ and two spaces instead of __constant__.

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of the cl functions returns an error code.
I further tried only to add one of the two smaller variables. If I add only one of them, it works, but if I add both of them, it breaks down.
So could this behavior of OpenCL mean, that I allocated to much __local memory? How do I find out, how much __local memory is usable by me?
The amount of local memory which a device offers on each of its compute units can be queried by using the CL_DEVICE_LOCAL_MEM_SIZE flag with the clGetDeviceInfo function:
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
The size returned is in bytes. Each workgroup can allocate this much memory strictly for itself. Note, however, that if it does allocate maximum, this may prevent scheduling other workgrups concurrently on the same compute unit.
Of course there is, since local memory is physical rather than virtual.
We are used, from working with a virtual address space on CPUs, to theoretically have as much memory as we want - potentially failing at very large sizes due to paging file / swap partition running out, or maybe not even that, until we actually try to use too much memory so that it can't be mapped to the physical RAM and the disk.
This is not the case for things like a computer's OS kernel (or lower-level parts of it) which need to access specific areas in the actual RAM.
It is also not the case for GPU global and local memory. There is no* memory paging (remapping of perceived thread addresses to physical memory addresses); and no swapping. Specifically regarding local memory, every compute unit (= every symmetric multiprocessor on a GPU) has a bunch of RAM used as local memory; the green slabs here:
the size of each such slab is what you get with
clGetDeviceInfo( · , CL_DEVICE_LOCAL_MEM_SIZE, · , ·).
To illustrate, on nVIDIA Kepler GPUs, the local memory size is either 16 KBytes or 48 KBytes (and the complement to 64 KBytes is used for caching accesses to Global Memory). So, as of today, GPU local memory is very small relative to the global device memory.
1 - On nVIDIA GPUs beginning with the Pascal architecture, paging is supported; but that's not the common way of using device memory.
I'm not sure, but I felt this must be seen.
Just go through the following links. Read it.
A great read : OpenCL – Memory Spaces.
A bit related stuff's :
How do I determine available device memory in OpenCL?
How do I use local memory in OpenCL?
Strange behaviour using local memory in OpenCL

Resources