opencl constant memory or value arguments - opencl

In OpenCL you can pass a buffer to a kernel via clSetKernelArg and mark that buffer as __constant in the kernel. Alternatively, you can also use clSetKernelArg to pass a value type.
My question is, where does the value type live? Does the API create a constant buffer behind the scenes? Does the API generate a special shader with those values as constant literals?
I'm just curious because I come from a direct3d/opengl background, and constants always had to be passed through constant buffers. So I'm wondering how passing a type by value as an argument works under the hood.

Related

Why is a buffer used in Win32 API syscall cast to [1<<20]<type> array?

I'm writing a golang application which interacts with Windows Services using the windows/svc package.
When I'm looking at the package source code how syscalls are being done I see interesting cast construct:
name := syscall.UTF16ToString((*[1 << 20]uint16)(unsafe.Pointer(s.ServiceName))[:]
Extracted from mgr.go
This is a common patttern when dealing with Win32 API when one needs to pass a pre-allocated buffer to receive a value from Win32 API function, usually an array or a structure.
I understand that Win API returns a unicode string represented by its pointer and it is passed to the syscall.UTF16ToString(s []uint16) function to convert it to the go string in this case.
I'm confused from the part when an unsafe pointer is cast to the pointer to 1M array, *[1<<20]uint16.
Why the size if 1M [1<<20]?
Buffer for a value is allocated dynamically, not with fixed size of 1M.
You need to choose a static size for the array type, so 1<<20 is chosen to be large enough to allow for any reasonable buffer returned by the call.
There is nothing special about this size, sometimes you'll see 1<<31-1 since it's the largest array for 32bit platforms, or 1<<30 since it looks nicer. It really doesn't matter as long as the type can contain the returned data.

OpenCL: clSetKernelArg vs. clSetKernelArg + clEnqueueWriteBuffer

A question regarding buffer transfer in OpenCL:
I want to pass a buffer (cl_mem) from the host to the kernel (i.e. to the device).
There are two host-functions:
clEnqueueWriteBuffer
clSetKernelArg
I use clSetKernelArg to pass my buffer to one of the kernel arguments. But does this mean that the buffer is automatically transfered to the device?
Further, there is the function clEnqueueWriteBuffer which writes a buffer to a device.
My question: is there any difference in using (a.) only clSetKernelArg or (b.) clSetKernelArg and clEnqueueWriteBuffer in combination for my use-case (pass buffers to kernel)?
You have to call both functions before enqueuing a kernel for execution.
clSetKernelArg
Used to set the argument value for a specific argument of a kernel.
This one only sets the argument value, e.g. some pointer, for the called kernel. There are no implicit data transfers.
Think of the following examples:
the same memory object is used as argument for different kernels
=> only one write to device needed; but multiple arguments to be set for different kernels
a changing input memory object could be used multiple times with the same kernel
=> one write per call; but kernel argument only set once
a read and a write buffer might be switched using clSetKernelArg() between two calls of the same kernel (double buffering)
=> maybe no transfer, or only every n iterations; but two arguments set before every call
In general: Data transfers between host and compute device are very expensive, and hence should be avoided, which is best possible by explicitly triggering them.
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clSetKernelArg.html
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html

Inferring type information from memory read size

I am using PIN to instrument my application binary and generating a list of addresses (more specifically memory reads) made by the application. I have an instrumentation routine, which passes the IARG_MEMORYREAD_SIZE, IARG_MEMORYREAD_EA as arguments. However,I want to infer the type information of the application variable based on memory size being read.
For example,
If PIN observes a memory read of 4 bytes, how can I conclude what type of data is being accessed. Is it int/float ? Similarly, for 8 byte data, how would I know if the data is a double variable or a pointer type variable.
There is no way you can infer the type of the operand just by its size. I even doubt you can do it in a reliable manner with the instruction.

OpenCL: cannot pass a integer to constant memory

I wrote a kernel like this
kernel void computeLayerOutput_Rolled(global Layer* restrict layers, constant int* restrict netSpec, __constant const int layer1)
But an cl::Error occurred when creating the kernel, and the error information is
:29:123: error: invalid address space for argument to __kernel function
kernel void computeLayerOutput_Rolled(global Layer* restrict layers, constant int* restrict netSpec, __constant const int layer1)
^
:29:123: error: parameter may not be qualified with an address space
terminate called after throwing an instance of 'cl::Error'
what(): clCreateKernel
when I remove the __constant qualifier of layer1, everything is OK, but I don't want to put it into the private memory since it might occupy a register in every work item. Passing an array with only one element seems not a very elegant solution either.
I just wondering that is there any other way to solve it?
The spec says (labels mine):
(1) The generic address space name for arguments to a function in a
program, or local variables of a function is __private. All arguments
to a __kernel function shall be in the __private address space.
(2) __kernel function arguments declared to be a pointer of a type can point to one of the following address spaces only: __global, __local
or __constant.
In other words, only pointers can be qualified as __constant in kernel arguments. layer1 is not a pointer, so it can't be __constant.
I don't want to put it into the private memory since it might occupy a register in every work item.
layer1 may already be using a register, because it is already in private memory: as quote (1) indicates, all arguments to a kernel function are in the __private address space, which may map to registers.
To clarify, when writing constant int* restrict netSpec, do not confuse:
Address space of the pointer (the argument netSpec is in __private address space)
Address space of the pointee (what netSpec points to is in __constant address space)
I have encountered the same 'problem' a while ago. I was used to the way CUDA C/C++ handles constant memory and tried to find a solution to get the same convenience in OpenCL. Long story short, I didn't found one. If you want constant memory you can do two things:
pass in a pointer to the constant address space as you suggested, or
use #defines and recompile
non of the solutions is very pretty but they do what the are supposed to.
However, if the only problem with passing an integer is the fear of higher register usage than you can go forward and just provide the integer per value. The single integer value will occupy as much registers as a pointer to constant memory if the size of the pointer is 32 bit.

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

Resources