I'm running a mandelbrot generator (2D image from static params) on OpenCL.
The program is straightforward:
__kernel
void mandelbrot(__global uchar * output,
const float xstep,
const float xoffset,
const float ystep,
const float yoffset,
const int maxiter)
{
int gid_y = get_global_id(1);
int gid_x = get_global_id(0);
//calculate x and y on the fly for every pixel.
//This is just as fast as reading precalculated rulers from global memory.
float x = gid_x * xstep + xoffset;
float y = gid_y * ystep + yoffset;
float real = 0;
float imag = 0;
int out = 0;
for(int curiter = 0; curiter < maxiter; curiter++) {
float nreal = real*real - imag*imag + x;
imag = 2* real*imag + y;
real = nreal;
if (real*real + imag*imag > 4.0f) {
out = curiter;
break;
}
}
//normalize output
out *= 256.0 / (float)maxiter;
output[gid_y * get_global_size(0) + gid_x] = out;
}
[EDIT] [posted full kernel, and swapped rows and columns as suggested. This way I gained 18% performance on AMD, but 0% on NVidia. The original code was
output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;
[/EDIT]
I'm running it on my Nvidia Quadro 1000M, which has 2 compute units and 96 CUDA cores (48 cores per compute unit).
I'm playing around by changing the local group size when enqueuing the kernel. These are the performance results I get with different sizes when generating a 400Mpixel image.
All numbers are from the OpenCL profiler and exclude the final memory copy back to the OS.
The image is 40992x10272 - both height and width are divisible by 48.
rows x columns
8x8: 397 MPixel/s
8x12: 505 MPixel/s
8x16: 523 MPixel/s
8x24: 521 MPixel/s
8x32: 520 MPixel/s
8x48: 520 MPixel/s
1x48: 321 MPixel/s
2x32: 424 MPixel/s
2x48: 523 MPixel/s
4x24: 519 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
4x48: 525 MPixel/s
12x8: 490 MPixel/s
12x12:464 MPixel/s
12x24:505 MPixel/s
12x32:508 MPixel/s
12x48:433 MPixel/s
16x8: 499 MPixel/s
16x12:499 MPixel/s
16x16:472 MPixel/s
16x24:450 MPixel/s
16x32:440 MPixel/s
16x48:418 MPixel/s
Some of these numbers leave me baffled.
While it is clear why I get best results with 48 columns (thanks to how SIMD operations work), I don't understand:
why does performance degrade dramatically when I use 16 rows per group?
why do I get poor performance with 1x48?
why in heaven do I get top performance with 3x32, 4x32, and 8x32?!? I would have expected 33% of the SIMD processors to sit idle, and instead it looks like a workgroup is sitting in between the two compute units?!?
why does PREFERRED_WORK_GROUP_SIZE_MULTIPLE return 32 instead of 48?
is there a non-empirical way to figure out the geometry for top performance on any GPU (ATI/Nvidia/Intel HD), given only what I acquire from the OpenCL info structures?
Thanks in advance
I answered to a similar question here that you might find interesting before reading the following.
why does performance degrade dramatically when I use 16 rows per group?
Actually it already degrades when you use 12 rows. Memory access works by transaction. A transaction will fetch a certain number of bytes in one shot. Now if several workitems try to access several contiguous elements in an array it means that one transaction might be enough to serve them all.
Because you access the memory in this way:
output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;
it means that the bigger the local size is in the dimension 0, the bigger the number of transaction will be since you have to access non contiguous elements (separated by get_global_size(1) elements). And global memory access is expensive.
So in the case of the 12/16 rows, you have at least 12/16 transactions needed. This lead to your second question:
why do I get poor performance with 1x48?
Based on what I've just said before, it seems that the performance should be great, since the number of transactions would be minimal.
But here comes the problem of idling threads. The information you got regarding the 48 cores per SM is wrong as already pointed out by others. Threads are executed in group (called warp for NVIDIA) of 32 on NVIDIA hardware. Note these groups are called wavefront and can be up to 64 threads for AMD. Since you have in this case a workgroup composed of 48 threads (1 by 48), it means that 64 threads are scheduled. It is always a number of threads multiple of 32 that is scheduled because you can't execute a fraction of a warp.
Therefore in this case you have a fourth of the threads that do nothing. And actually when you compare with the result you obtained for 2x32 (still 64 threads - 2 warps, but fully utilized) 321 MPixel/s is pretty much 3/4 of 424 MPixel/s.
It is worth noting also this result: 2x48: 523 MPixel/s. In this case your workgroup size is 96 a multiple of 32. So no idling threads.
why in heaven do I get top performance with 3x32, 4x32, and 8x32?!?
Well, the answer comes from the two previous ones: You use multiple of 32, and you keep the number of threads in the dimension 0 relatively small. But let's have a closer look to your results:
2x32: 424 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
8x32: 520 MPixel/s
16x32: 440 MPixel/s
The decrease of performance for the two last lines is easily explained with what was said. However, the increase of performance between the first and the second line is not.
The increase of performance comes form somewhere else in this case. The fact that in the second case enough warps run on the same SM to hide the access memory latency. You see the REFERRED_WORK_GROUP_SIZE_MULTIPLE value says only that you should try to use a MULTIPLE of this value for best performance. Several warps can be scheduled on the same SM at the same time.
So, how does it work? Let's take the 3x32 case. You have a workgroup composed of 3 warps. Since they belong to the same workgroup they are scheduled on the same SM as required by the OpenCL standard (if it wasn't the case, sync between threads within a workgroup wouldn't be possible).
The first warp starts to run until it gets stall because a memory access is needed. Meanwhile the warp 1 waits for the memory transactions to complete, the warp 2 can start to run. Since there is a lot of registers on the SM, the SM can easily and quickly switch context to run others warps. All the variables of the warp 1 stay on the registers allocated to the warp 1. Then the warp 2 hits the line where a memory access is required and gets stall. At that moment, the next ready to run warp can start running. It could be the warp 3 but also the warp 1 if its memory access is completed. In your case it seems that it is the warp 3 that runs, since you have a difference between 2x32 and 3x32. In the first case there are not enough warps scheduled to hide the memory accesses though in the second case there are.
As a mater of fact, this influence as well the bad performance for the 1x48 size from the question 2.
why does PREFERRED_WORK_GROUP_SIZE_MULTIPLE return 32 instead of 48?
Already answered.
is there a non-empirical way to figure out the geometry for top performance on any GPU (ATI/Nvidia/Intel HD), given only what I acquire from the OpenCL info structures?
It's like for any other languages. When you know how it works under the hood, it helps you to produce good first code. But you'll still have to benchmark it, and go through a process of trial and errors to tweak it. Keeping in mind what I've just written is only a small part of the things that matter for performance. Querying some info from OpenCL combined with a good understanding of CPU/GPU will obviously help... but that's it.
Because a lot of parameters influencing performance are antagonists, what you'll gain inone side, will be lost in the other.
Therefore keep benchmarking ;).
It all depends on the code you are not showing. And that is the key.
If your code was very simple ie: out = 8; then your supposition will probably be correct.
However, as you said, the value REFERRED_WORK_GROUP_SIZE_MULTIPLE returns 32. This means, that 32 is the maximum concurrent threads the compute unit can launch in parallel without affecting the performance.
For example, there is no sense in launching more than 32. If with 32 you already deplete the local memory storage and you need to recur to global memory (which is dammly slow).
If you try to go over the recomended limit, you obtain exactly that -> A performance decrease. It is not that 32 is better, is the oposite. 48 is bad.
I recomend to you:
Use the automatic size if possible (pass null as local size to the kernel). This leads to the maximum performance if you are not worried about the local worksize shape.
Use the REFERRED_WORK_GROUP_SIZE_MULTIPLE as a reference if you need to set the local size manually.
The way your kernel accesses global memory is critical, and determined by the work group and global dimensions:
what addresses will be written by consecutive work items in the same work group? Here the stride is get_global_size(1), you may want to swap X and Y. It is generally faster to address consecutive elements in consecutive work items. This is the most important factor.
what addresses will be written by consecutive work groups? Consecutive work groups will frequently be scheduled at the same time on different compute units. They may end up competing for the same channel/bank, resulting in loss of performance.
it is generally preferable to write 32-bit integers instead of bytes.
To maximize performance, I suggest you introduce more buttons to turn: write kernels computing a block of several pixels (4x2 for example) inside a single work-item, and then benchmark all combinations of (block size) x (work-group size) x (XY swap) x (image size). Then pick the best for your GPU.
Related
I get the execution time of vector adder with different size of groupsize and I only use one group in this experiment.
groupsize --------execution time
1 ----------------3.6
50 ---------------4.22
100 --------------4.3
200 --------------4.28
300 --------------4.3
400 --------------4.31
500 --------------4.38
600 --------------4.38
700 --------------4.78
800 --------------5.18
900 --------------5.78
1000 -------------6.4
Can I get the conclusion one sm can work about 600 workitems together?
and I have some questions, could anybody can help me?
Why does the execution time increase sharply when groupsize increases from 1 to 50 and from 600 to 1000?
thank you very much
It would be helpful to see some code, both of the kernel and the host enqueueing parameters. The conclusions also depend on what sort of hardware you're running this on - GPU, CPU, accelerator, FPGA, …?
A few ideas:
GPUs typically can run power-of-2 number of threads in parallel in an execution unit. You will likely get better results if you try e.g. 16, 32, 64, 128, etc. CPUs and other accelerators typically have SIMD-widths which are powers of 2 too, for example x86-64 SSE registers can hold 4 floats, AVX 8, AVX512 16, etc. so it most likely will help there, too.
As you can vary group size so freely, I'm going to assume your work-items don't need to coordinate among each other via local memory or barriers. (The problem is embarrassingly parallel.) A group size of 1 in theory allows your compiler, driver, and hardware maximum flexibility for distributing work-items to threads and parallel execution units optimally. So it should not be a surprise that this is the fastest. (Depending on register pressure and memory access patterns it can still sometimes be helpful to manually increase group size for specific types of hardware in the embarrassingly parallel case.)
On GPUs, all items in a work group must run on the same execution unit, in order to be able to coordinate and share local memory. So by increasing the group size, you're limiting the number of execution units the workload can be spread across, and the execution units need to run your work-items serially - you're reducing parallelism. Above 600 you're probably submitting fewer workgroups than your hardware has execution units.
I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.
I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.