A question regarding buffer transfer in OpenCL:
I want to pass a buffer (cl_mem) from the host to the kernel (i.e. to the device).
There are two host-functions:
clEnqueueWriteBuffer
clSetKernelArg
I use clSetKernelArg to pass my buffer to one of the kernel arguments. But does this mean that the buffer is automatically transfered to the device?
Further, there is the function clEnqueueWriteBuffer which writes a buffer to a device.
My question: is there any difference in using (a.) only clSetKernelArg or (b.) clSetKernelArg and clEnqueueWriteBuffer in combination for my use-case (pass buffers to kernel)?
You have to call both functions before enqueuing a kernel for execution.
clSetKernelArg
Used to set the argument value for a specific argument of a kernel.
This one only sets the argument value, e.g. some pointer, for the called kernel. There are no implicit data transfers.
Think of the following examples:
the same memory object is used as argument for different kernels
=> only one write to device needed; but multiple arguments to be set for different kernels
a changing input memory object could be used multiple times with the same kernel
=> one write per call; but kernel argument only set once
a read and a write buffer might be switched using clSetKernelArg() between two calls of the same kernel (double buffering)
=> maybe no transfer, or only every n iterations; but two arguments set before every call
In general: Data transfers between host and compute device are very expensive, and hence should be avoided, which is best possible by explicitly triggering them.
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clSetKernelArg.html
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html
Related
I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?
If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel can optionally return an event object, which can be passed to subsequent commands as dependencies.
I am trying to reduce register pressure in my kernel. There are certain fixed values that I am currently calculating, such as the dimensions of the image I am processing; does it make sense to pass these dimensions in as kernel arguments? They are fixed for all work groups. I read somewhere that kernel arguments get special treatment and are not assigned to registers.
The OpenCL spec mandates that kernel arguments be in the __private address space, so in theory kernel arguments may be stored in registers, constant memory, dedicated register file or anything else. In practice, implementations will often put kernel arguments in constant memory (constant memory, not __constant address space). Constant memory is a read only small memory that GPUs use for broadcasting general data (like camera matrices). They are very fast, much faster than global memory. Similar speed to local memory.
If you pass a value to the kernel, then it will reside in the constant memory. There will be no fetch to global.
However, that data will eventually reside in registers(like any other data) in order to operate with it. You will not save any registers. But at least it will make your kernel run faster.
Consider a pair of OpenCL kernels which read and write to the same memory locations. As a simple example, consider the following OpenCL program:
__kernel void k1(__global int * a)
{
a[0] = 2*a[1];
}
__kernel void k2(__global int * a)
{
a[1] = a[0]-1;
}
If many threads are launched, running many of each of these kernels, the resulting state of global memory is non-deterministic.
This still potentially allows one to write asynchronous algorithms which accept any of the possible orderings of the operations within the kernels.
However, this requires that reads and writes to global GPU memory are atomic.
My questions are
Is this guaranteed to be true on any current GPGPU hardware?
If this considered undefined behavior by the OpenCL standard? If so, what do common implementations (specifically that included with the CUDA toolkit) do?
How can one test this concern?
If you enqueue your kernel commands to a single command queue that is created as an in-order queue (i.e. you didn't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you created it), then only one kernel command will execute at a time. This means that you won't have any such issues between different kernel instances (although you could still have race conditions between work-items in a single kernel instance, if they are accessing the same memory locations).
If you are using out-of-order queues, or multiple command-queues, then you may indeed have a race condition. There is no guarantee that your load-modify-store sequence will be an atomic operation, and this will cause undefined behaviour.
Depending on what you actually want to do with your kernels, you may be able to make use of OpenCL's built-in atomic functions, which do allow you to perform a particular set of read-modify-write operations in an atomic manner.
I am new to UNIX, and I am studying some of UNIX system calls such as brk(), sbrk(), and so on....
Last day I have read about malloc() function, and I was confused a little bit!
Can anybody tell me why malloc reduces the number of sbrk() system calls that the program must perform?
And another question, do brk(0), sbrk(0) and malloc(0) return the same value?
Syscalls are expensive to process because of the additional overhead that a syscall places: you have to switch to kernel mode. A system call gets into the kernel by issuing a "trap" or interrupt. It's a call to the kernel for a service, and because it executes in the kernel address space, it has a high overhead switch to kernel (and then switching back).
This is why malloc reduces the number of calls to sbrk() and brk(). It does so by requesting more memory than you asked it to, so that it doesn't have to issue a syscall everytime you need more memory.
brk() and sbrk() are different.
brk is used to set the end of the data segment to the value you specify. It says "set the end of my data segment to this address". Of course, the address you specify must be reasonable, the operating system must have enough memory, and you can't make it point to somewhere that would otherwise exceed the process maximum data size. Thus, brk(0) is invalid, since you'd be trying to set the end of the data segment to address 0, which is nonsense.
On the other hand, sbrk increments the data segment size by the amount you specify, and returns a pointer to the previous break value. Calling sbrk with 0 is valid; it is a way to get a pointer to the current data segment break address.
malloc is not a system call, it's a C library function that manages memory using sbrk. According to the manpage, malloc(0) is valid, but not of much use:
If size is 0, then malloc() returns either NULL, or a unique pointer
value that can later be successfully passed to free().
So, no, brk(0), sbrk(0) and malloc(0) are not equivalent: the first of them is invalid, the second is used to obtain the address of the program's break, and the latter is useless.
Keep in mind that you should never use both malloc and brk or sbrk throughout your program. malloc assumes it's got full control of brk and sbrk, if you interchange calls to malloc and brk, very weird things can happen.
why malloc reduces the number of sbrk() system calls that the program
must perform?
say, if you call malloc() to request 10 bytes memory, the implementation may use sbrk (or other system call like mmap) to request 4K bytes from OS. Then when you call malloc() next time to request another 10 bytes, it doesn't have to issue system call; it may just return some memory allocated by system call of the last time 4K.
malloc() function is used to call the sbrk system call to create a memory dynamically during the process.
malloc() function is already assigned in stdlib.h header file so the as per the required function is recursively call by the malloc function using the library function.
with the help of sbrk we need to explicitly declare some thing to call the system call.
According to the size given in function or through system call it return to the variable and store.
sbrk() function increases the programs data segment allocation by specified bytes.
malloc(4096); // sbrk += 4096 Bytes
free(); // freeing memory will not bring down the sbrk by 4096 Bytes
malloc(4096); // malloc'ing again will not increase the sbrk and it will use
the existing space which not result in sbrk() call.
In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.