lets way we have list size of 500, we have 4 kernels and we want to give 1/4 of the list to each kernel. how can we do that in OpenCL or pyOpenCl ?
I know there should probably be barrier somewhere and but how do we send the data to other kernels ? is there an exampl of Scatter in OpenCL ?
Related
What is the actual difference of putting multiple kernels in a single program, or compiling a different program for each kernel, excluding source code organization? Specifically, is the register pressure dictated by the size of the program or by the actual kernel that is chosen within the program? Is the sum of all __local storage of all kernels allocated for the run of any of the kernels? Is there any other performance-related observation to make (e.g. code upload size to device, etc.)?
This could be device specific, and I speak from Intel GPU experience. Program-scope resources will only be visible to kernels in that program. Beyond that register allocation is per-kernel; hence, 1 kernel in K programs vs. K kernels in 1 program has no effect on register pressure. You do build and link per-program. Hence, compiling K kernels in one program is less efficient in terms of startup time if you don't use all the of K kernels.
My pyopencl kernel program is started with global size of (512,512), I assume it will run 512x512=262,144 times. I want to find the minimum value of a function in my 512x512 image but I don't want to return 262,144 floats to my CPU to calculate the min. I want to run another kernel (possibly waiting in the queue ) to find the min value of all 262,144 pixels then just send that one float to the CPU. I think this would be faster. Should my waiting kernel's global size be (1,1), ? I hope the large 262,144 Buffer of floats that I created using mf.COPY_HOST_PTR will not cross the GPU/CPU bus before I call the next kernel.
Thanks
Tim
Andreas is right: reduction is the solution. Here is a nice article from AMD explaining how to implement simple reduction. It discusses different approaches and the gain in terms of performance they bring. The example in the article is about summing all the elements and not to find the minimum, but it's fairly trivial to modify the given codes.
BTW, maybe I don't understand well you first sentence, but a kernel launched with a global size of (512, 512) will not run 262,144 times but only one time with 262,144 threads scheduled.
Use a reduction kernel to find the minimum.
I am using an Quadro FX 880 card. In my Image Segmentation code, I have divided the image into 4 parts(ie if there are 4000 pixels, each part is of 1000 pixels). I have 8 kernels in my code... first 4 of which are to be executed in parallel and the next four again in parallel but after the first four kernels get executed. Is this possible if I use same command queue for all 8 kernels and specify a clEnqueueNDRangekernel command for each of the first four kernels and I mention OUT_OF_ORDER argument while creating the command queue...? And if this is possible how to execute the next four kernels in parallel, that are to executed after first four kernels..? Can I give a clWaitForEvents command after the first four kernels and then specify the next four kernels..? will this guarantee that the first four kernels are executed in parallel and the next four are executed after them but in parallel..?
I think clEnqueueTask would make my code slow, since I have about 1000 pixels in each kernel and clEnqueueTask allows gobal_workitem_size and the local_work_item_size to be just 1....!
I am not sure whether all these things can be done...and what is wrong or right... so I just need a confirmation...! But if not in this way please suggest an alternative way...!
Newer devices should support running multiple kernels on multiple command queues in parallel.
The FX880M is a compute capability 1.2 device. It doesn't support simultaneous/concurrent kernel execution, only execution-copy overlap. You can create a command queue with out of order execution, but it won't have the effect you are asking about.
I need to limit the number of compute units used by my opencl application.
I'm running it on a CPU that has 8 compute units, I've seen that with CL_DEVICE_MAX_COMPUTE_UNITS.
The execution time I get with OpenCL is much less than 8 times the normal algorithm without OpenCL (is like 600 time faster). I want to use just 1 compute units because I need to see the real improvement with the same code optimized by OpenCL.
It's just for testing, the real application will continue to use all the compute units.
Thanks for your help
If you are using CPUs, Why dont you try using the OpenCL device fission extension ?
Device Fission allows you to split up a computer unit into sub-devices. You can then create a command queue to the subdevice and enqueue kernels only to that subset of your CPU cores,
You can divide your 8 core device into 8 subdevices of 1 core each for example.
Take a look at the Device Fission example in the AMD APP SDK.
I can not understand what work_dim is for in clEnqueueNDRangeKernel()?
So, what is the difference between work_dim=1 and work_dim=2?
And why work items are grouped into work groups?
A work item or a work group is a thread running on the device (or neither)?
Thanks ahead!
work_dim is the number of dimensions for the clEnqueueNDRangeKernel() execution.
If you specify work_dim = 1, then the global and local work sizes are unidimensional. Thus, inside the kernels you can only access info in the first dimension, e.g. get_global_id(0), etc.
If you specify work_dim = 2 or 3, then you must also specify 2 or 3 dimensional global and local worksizes; in such case, you can access info inside the kernels in 2 or 3 dimensions, e.g. get_global_id(1), or get_group_id(2).
In practice you can do everything in 1D, but for dealing with 2D or 3D data, it maybe simpler to directly use 2/3 dimensional kernels; for example, in the case of 2D data, such as an image, if each thread/work-item is to deal with a single pixel, each thread/work-item could deal with the pixel at coordinates (x,y), with x = get_global_id(0) and y = get_global_id(1).
A work-item is a thread, while work-groups are groups of work-items/threads.
I believe the division work-groups / work-items is related with the hardware architecture of GPUs and other accelerators (e.g. Cell/BE); you can map the execution of work-groups to GPU Stream Multiprocessors (in NVIDIA talk) or SPUs (in IBM/Cell talk), while the corresponding work-itens would run inside the execution units of the Stream MultiProcessors and/or SPUs. It's not uncommon to have work group size = 1 if you are executing kernels in a CPU (e.g. for a quad-core, you would have 4 work groups, each one with one work item - though in my experience it's usually better to have more workgroups than CPU cores).
Check the OpenCL reference manual, as well as the OpenCl manual for whichever device your are programming. The quick reference card is also very helpful.