I have a question about work-group scheduling on multiple CUs in Intel FPGA. As work-groups are assigned to available CUs, when is a CU considered available? Is it when the last work-item of the previous work-group has abandoned​ the pipeline or when it is at its second stage (so every stage of the pipeline is occupied by the previous work-group except the first one)? This decision is taken by the hardware scheduler, but I haven't found any public documentation explaining any of this.
Edit: I'm talking about the automatic process that takes place when using Intel OpenCL SDK for FPGA, not a custom design with HDL.
Apart from work-items, work-groups are also pipelined in each CU. Hence, you have multiple work-groups in-flight in the same CU at the same time, to maximize pipeline efficiency. If your design uses local-memory-based buffers, when you check the HTML report, you will see that the compiler is further replicating those buffers to support a specific number of "Simultaneous Work-groups". The number of such work-groups seems to be equal to pipeline depth per CU divided by work-group size. In the end, all work-item from all work-groups running simultaneously in the same CU are pipelined one after another, with work-items belonging to each CU using their own local buffers, and barrier synchronization being enforced per work-group. A new work-group will be scheduled in each CU, probably after one of the simultaneously-running work-groups has completely left the CU.
P.S. It is probably easier to get help related to Intel FPGA SDK for OpenCL in their own forum.
P.P.S. The details of Intel's OpenCL work-item scheduler is not documented anywhere.
Related
I am using ROCm software stack to compile and run OpenCL programs on a Polaris20 GCN4th AMD GPU and wondering if there is a way to find out which compute unit (id) on GPU is in use now by the current work-item or wavefront?
In other words, can I associate a computation in a kernel to a specific compute unit or specific hardware on GPU, so I can keep track of which part of the hardware is getting utilized while a kernel runs.
Thank you!
I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald
Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.
Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.
My understanding of the differences between CPUs and GPUs is that the GPUs are not general purpose processors such that if a video card contains 10 GPUs, each GPU actual share the same program pointer and to optimize parallelism on the GPU I need to ensure each GPU is actually running the same code.
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
My question is, how does this work on multiple cards? At the speed at which they operate at, doesn't the hardware make a slight difference in execution times such that a calculation on one GPU on one card may end quicker or slower than the same calculation on another GPU on another card?
thanks
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
This is not true. Different threads on a GPU may complete at different times due to differences in memory access latency, for example. That is why there are synchronization primitives in OpenCL such as the barrier command. You can never assume that your threads are running precisely in parallel.
The same is true for multiple GPUs. There is no guarantee that they are in sync, so you will need to rely on API calls such as clFinish to explicitly synchronize their work.
I think you may be confused about how threads work on a GPU. First to address the issue of multiple GPUs. Multiple GPUs NEVER share the program pointer, so they will almost never complete a kernel at the same time.
On a single GPU, only threads that are executing ON THE SAME COMPUTE UNIT (or SM in NVIDIA parlance) AND are part of the same warp/wavefront are guaranteed to execute in sync.
You can never really count on this, but for some devices the compiler can determine that will be the case (I am specifically thinking about some AMD devices, as long as the worgroup size is hardcoded to 64).
In any case, as #vocaro pointed out, that's why you need to use a barrier for local memory.
To emphasize, even on the same GPU, threads are not executing in parallel across the whole device - only within each compute unit.
I am new to OpenCL and I am writing an RSA factoring application. Ideally the application should work across both NV and AMD GPU targets, but I am not finding an easy way to determine the total number of cores/stream procs on each GPU.
Is there an easy way to determine how many total cores/stream procs there are on any hardware platform, and then spawn a factoring thread on each available core? The target RSA modulus would be in shared memory, and with each factoring thread using a Rho factoring attack against the modulus.
Also, any idea if OpenCL support multi-precision math libraries similar to GNU MP, to store large semi prime numbers?
Thanks in advance
On the GPU, you don't spawn one thread for each core, like you would on a CPU. Instead, you want to start many more threads than there are cores. I wouldn't worry about the exact number of cores available on a given target platform. Instead, focus no what fits best for your problem.
To add to Roger's answer, the reason why you would want to have many more threads than cores is because GPUs implement very efficient context switching to hide memory latency. Generally, each memory access is a very expensive operation in terms of the amount of time it takes for a processor to receive the requested data. But if a thread is waiting on a memory transaction, it can be "paused" and another thread can be activated to do computations (or other memory accesses) in the meantime. So if you have enough threads, you can essentially hide the memory access latency, and your software can run at the full computational capacity of the hardware (which would rarely happen otherwise).
I would have put this in a comment to Roger's post, but its size is beyond the limit.
Is it possible to achieve the same level of parallelism with a multiple core CPU device as that of multiple heterogenous devices ( like GPU and CPU ) in OpenCL?
I have an intel i5 and am looking to optimise my code. When I query the platform for devices I get only one device returned: the CPU. I was wondering how I could optimise my code by using this.
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices or does it have to be done manually by the programmer?
Can a cpu device achieve the same level of parallelism as a gpu? Pretty much always no.
The number of compute units in a gpu is almost always more than in a cpu. For example, $50 can get you a video card with 10 compute units (Radeon 6450). The cheapest 8-core cpus on newegg are going for $189 (desktop cpu) and $269 (server).
The compute units of a cpu will run faster due to clock speed, and execute branching code much better than a gpu. You want a cpu if your workload has a lot of conditional statements.
A gpu will execute the same instructions on many pieces of data. The 6450 gpu has 16 'stream processors' per compute unit to make this happen. Gpus are great when you have to do the same (small/medium) tasks many times. Matrix multiplication, n-boy computations, reduction operations, and some sorting algorithms run much better on gpu/accelerator hardware than on a cpu.
I answered a similar question with more detail a few weeks ago. (This one)
Getting back to your question about the "same level of parallelism" -- cpus don't have the same level of parallelism as gpus, except in cases where the gpu under performs on the execution of the actual kernel.
On your i5 system, there would be only one cpu device. This represents the entire cpu. When you query for the number of compute units, opencl will return the number of cores you have. If you want to use all cores, you just run the kernel on your device, and opencl will use all of the compute units (cores) for you.
Short answer: yes, it will run in parallel and no, no need to do it manually.
Long answer:
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices [...]
Either you need to revise your OpenCL vocabulary or I didn't understand your question. You only have one device and core != device!
One CPU, regardless of how many cores it has, is one device. The same goes for a GPU: one GPU, which has hundreds of cores, is only one device. You send jobs to the device through the queue and the device's driver. Your jobs can (and will) be split up into work-items. Then, some (how many depends on the device/driver) work-items are executed in parallel. On the GPU aswell as on the CPU, one work-item is executed by one kernel. (This might not be completely true but it is a very helpful abstraction.)
If you enqueue several kernels in one queue (without connecting them through a wait event!), the driver may or may not run them in parallel.
It is the very goal of OpenCL to allow you to compute work-items in parallel regardless of whether it is using several devices' cores in parallel or only a single devices cores.
If this confuses you, watch these really good (and long) videos: http://macresearch.org/opencl
How are you determining the OPENCL device count? I have an Intel I3 laptop that gives me 2 OpenCL compute units? It has 2 cores.
According to Intels spec an I5-2300 has 4 cores and supports 4 threads. It isn't hyper-threaded. I would expect a OpenCL call to the query the # devices to give you a count of 4.