OpenCL: know local work group size in advance? - opencl

I'm working on optimizing a separable image downscaler. My next step is reduction of multiple samplings (nearest) of the same texel by reading all necessary texels into local memory. Here begins the fun...
The downscaler is versatile, so it can downscale anything larger into anything smaller and even take sections of an image and downscale it into a destination image. Thus the final resolution divider never is a whole number. Most of the time it will be something around 3.97 or such. This means: I do not know the required size for that local array at compile time.
To me that means: before enqueuing a task, I'll have to create a local mem object of the required size.
How do I know what workgroup sizes OpenCL will select?
If there is no way, is there a "best practice" to overcome this problem?
P.S.: I'm writing for OpenCL 1.1 compatibility.

Since you are using images, the texture cache can be relied upon instead of using shared local memory.


random memory access and bank conflict

in these days, i'm trying program on mobile gpu(adreno)
the algorithm what i use for image processing has 'randomness' for memory access.
it refers some pixels in 'fixed' range for filtering.
BUT, i cant know exactly which pixel will be referred(depends on image)
as far as i understood. if multiple thread access local memory bank
it causes bank conflict. so in my case it should make bank conflict.
MY question: Can i eliminate bank conflict at random memory access?
or can i reduce them?
Assuming that the distances of your randomly accessed pixels is somehow normal distributed, you could think of tiling your image into subimages.
What I mean: instead of working with a (lets say) 1024x1024 image, you might have 4x4 images of size 256x256. Each of them is kept together in memory, so "near" pixel access stays within the same image object. Only the far distance operations need to access different subimages.
A second option: instead of using CLImage objects, try to save your data into an array. The data in the array can be stored in a Z-order curve sorting. This also leads to a reduced spatially distribution (compared to row-order-sorting)
But of course, this depends strongly on your image size.
There are a variety of ways to deal with bank conflicts - the size of the elements you are working with, the strides between lines and shifting the coordinates around to different memory addresses. It's never going to be as good as non-random / conflict free though and so what you will notice is depending on the image - you will see significantly different compute times.

What is the advantage of using a 1d image over a 1d buffer?

I understand that in 2d, images are cached in x and y directions.
But in 1d, why would you want to use an image? Is the memory used
for images faster than memory used for buffers?
1D Image stays the image, so it has all advantages that Image has against Buffer. That is:
Image IO operations are usually well-cached.
Samplers can be used, which gives benefits like computationally cheap interpolation, hardware-resolved out-ouf-bound access, etc.
Though, you should remember that Image has some constraints in comparison to regular Buffer:
Single Image can be used either for reading or for writing within one kernel.
You can't use vloadN / vstoreN operations, which can handle up to 16 values per call. Your best option is read_imageX & write_imageX functions, which can load / store up to 4 values per one call. That can be serious issue on GPU, with vector architecture.
If you are not using 4-component format, usually, you are loosing part of performance as many functions process samples from color planes simultaneously. So, payload is decreasing.
If we talk about GPU, different parts of hardware are involved into processing of Images & Buffers, so it's difficult to draw up, how one is better than another. Carefull benchmarking & algorithm optimizations are needed.

OpenCL : Id of the physical core being used

I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,

How to create a large Compatible Memory DC in GDI programming?

I want to create a large CompatibleDC, draw a large image on it, then bitblt part of the image to other DC, in order to achieve high performance. I am using the following code to create compatible Memory DC. But when the rect becomes very large, etc: 5000*5000, the CompatibleDC created become unstable. sometimes it is OK, sometimes it failed. is there any thing wrong with my code?
input :pInputDC
pOutputMemDC=new CDC();
CRect rect(0,0,nDCWidth,nDCHeight);
CBitmap bitmap;
if (bitmap.CreateCompatibleBitmap(pInputDC, rect.Width(), rect.Height()))
m_pOldBitmap = pOutputMemDC->SelectObject(&bitmap);
CBrush brush;
VERIFY(brush.CreateSolidBrush(RGB(255,0, 0)));
pOutputMemDC->FillRect(rect, &brush);
Instead of creating a large DC and then blitting a portion of it another, smaller DC, create a DC the same size as the destination DC, or at least the same size as the blit destination. Then, offset all your drawing commands by the (-x,-y) of the sub section you want to copy. If your destination is (100,200)-(400,400) on the source then create a DC (300x200) and offset everything by (-100,-200).
This has two big advantages: firstly, the memory required is much smaller. Secondly, GDI will clip your drawing operations to the size of the DC (it always clips anyway). Although the act of clipping takes CPU time, the time saved by not drawing pixels that aren't seen more than makes up for it.
Now, if this large DC is something like an image (JPEG for example) then you need to look into other methods. One technique used by many image editing programs is to split the image into tiles and page the tiles to/from memory/hard disk. Each tile is its own DC and you only have enough source DCs to fill the target DC. As the view window moves across the large image, unload tiles that have moved out of the target rectangle and load tiles that have become visible.
Each 5000x5000 pixel image needs ca. 100MB of RAM. Depending on how much RAM your PC has, this might already be the problem.
If you have 1GB of RAM or more, then that's probably not the issue. In this case, you must have a memory leak. Where do you free the allocated bitmap? I see that you unrealize the brush but how about the bitmap?
Note that increasing your swap won't help since that will kill your performance.
Make sure you are selecting all original GDI objects to the DCs.
The problem may be that your Bitmap is still selected into the pOutputMemDC when it is being destroyed and one of them or both can't be deleted properly. Thus problems with memory might begin.
