AMD OpenCL asynchronous execution efficency - opencl

For example, I have three tasks A, B, and C. Among them B and C depends on A. And there are sufficent CU's to run B and C at the same time. And then I enqueue A and C on queue0, and B on queue1. And there is a huge delay after A is finished and before B is started, which make the whole job taking longer time than using only one queue.
Is this normal? Or could I have done something wrong?
I will write a sample code if required, the original code is heavily encapsuled. But actually I just create an event when enqueuing A and pass it to the enqueuing of B, and both queues are just normal in order queue. Nothing seems to be special.

I couldn't find info about latencies but, to call something normal, we need statistically derived latency base for all platforms, here is mine:
HD7870 and R7-240 showing same behaviour. Windows 10. Two channel RAM. OpenCl 1.2(64 bit build). CodeXL profiling. All in-order queues. Some old drivers before crimson.
eventless single queue with non-blocking commands: several microseconds to 200 microseconds fluctuating but average must be low like 50 microseconds and depending on drivers, for some kernels it goes to 500 microseconds maybe because of too many parameters and similar preparations.
event source = single queue-A, event target = queue-B: 100-150 microseconds to half a millisecond(seemed constant)
event source = N-1 queues list, event target = queue-N: Not sum of all latencies of queues but some kind of latency hidden is there, so its not more than 2 millisecond(sometimes peaks to 3-5 milliseconds rarely)
event source = queue, waiting by clWaitForEvents from host: about a millisecond
event source = queue, waiting by clGetEventInfo from host in while-loop: nearly half a millisecond, sometimes even less
clFinish for single queue: This has most latency per queue like 1ms at least.
user events: were generating errors in codeXL so I couldn't query their performance but it was an older driver and older codeXL version.
There were background processes: avira, google chrome,.. which are advanced enough to use GPU for their purpose and may hinder kernel executions.
My solution to these were pipelining through usage of many independent queues to hide their event latencies and worked like a charm. R7-240 was running on 16-queues fine. It has only 2 ACE units so newer cards having 4-8 of them could work with more queues.
What I didn't try and wonder is: N queue waiting for completion M other queues with event list performance. Maybe tree-like waiting structure could be better for many queues if they lag too much.

Related

Completely simultaneous execution of two instructions (RISCV)

The question comes from a RISCV implementation, but I think it may also apply to many other architectures.
From a code with two completely independent instructions in sequence (generic ISA notation):
REG1 = REG2 + REG3
REG4 = REG5 + REG6
In a pipelined implementation, assuming there are no other hazards (simultaneous r/w access to the registers is possible and there are two independent adders), is it a violation of the ISA if the two instructions are executed completely in parallel?
In other words, at the same clock edge, can the 3 registers (REG1, REG4 and PC) be updated at once (PC+8 for the RISCV-32 example)?
No, clearly there's no problem, since real CPUs do this all the time. (e.g. Intel since Haswell can run 4 independent add instructions per clock: https://www.realworldtech.com/haswell-cpu/4/ https://uops.info/ https://agner.org/optimize/).
It only has to maintain the illusion of having run instructions one at a time, following the ISA's sequential execution model. The same concept as the C "as-if" rule applies.
If the ISA doesn't guarantee anything about timing, like that you can delay N clock cycles with N nop or other instructions, nothing stops a specific implementation from doing as much work as possible in a clock cycle. (Some microcontrollers do have specific timing guarantees or specifications, so code can delay for N cycles with delay loops. Or at least specific implementations of some ISAs have such guarantees.)
It's 100% normal for modern CPUs to average more than 1 instruction per clock, despite stalling sometimes on cache misses and branch mispredicts, so that clearly means fetching, decoding, and executing multiple instructions per clock cycle in other cycles. See also Modern Microprocessors
A 90-Minute Guide! for some basics of superscalar in-order and out-of-order pipelines.

Why is this jump instruction so expensive when performing pointer chasing?

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

OpenCL Parallel Dispatch

I am using beta support for OpenCL 2.0 on NVIDIA and targeting highend GPU like 1080ti. In my compute pipeline, I need to sometimes dispatch work to independently image process relatively small images. In theory, I think these images should be able to be processed in parallel on a single GPU because the amount of work groups for a single image won't saturate all the compute units of the GPU.
Is this possible in OpenCL? Does this have a name in OpenCL?
If it is possible, is using multiple queues for a single device the only way to do this? Or will the driver look at the "waitEventList" and decide which kernels can be processed in parallel?
Do I need CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?
1- Yes, this is one of ways to achieve high yield on occupation of compute units. General name can be "pipelining"(with help of asynchronous enqueueing and/or dynamic parallelism). There are different ways, one is doing reads on 1 queue, doing writes on another queue, doing compute on a third queue with 3 queues in control with wait events; second way could be having M queues each doing a different image's read-compute-write work without events.
2- You can even use single queue but an out-of-ordered type so kernels are dispatched independently. But at least for some amd cards, even an in-order queue can optimize independent kernels (according to amd's codexl) with concurrent execution(this may be out of opencl specs). Wait events can be a constraint to stop this type of driver-side optimizations(again, at least on amd)
From 2.x onwards, there is device-side queueing ability so you can enqueue 1 kernel from host and that kernel can enqueue N kernels, independently of host intervention(if all data is already uploaded to card), this may not be as latency-hiding as using multiple host-side queues(if data is needed from host to device).
3- Out of order execution is not forced on vendors so this may not work.

OpenCL enqueued kernels using lots of host memory

I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
While the the kernel executions are enqueued the memory usage goes up.
While the kernels are executing the memory usage stays constant.
Once the kernels finish up the usage goes down to its original state.
I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
Is that kind of memory usage normal and to be expected?
Are there good/known techniques to reduce memory usage?
Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with clFinish()?
Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush may help to some degree then clWaitForEvents is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:
enqueue 999 kernel commands
while(invocations < 8000000)
{
enqueue 1 kernel command with an event
enqueue 999 kernel commands
wait for the event
}
The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.

Hyperthreading - realy X2 cores?

According to Intel (If I'm not wrong) the Hyper threading (HT) can:
- up the performances up to 30%.
- HT can make better use the CPU when there is one task which use the ALU unit and the other doing I/O (for example: one task use zip alg and the other task use to write data into the disk) - in that can the HT can be used.
So, If I have one core with HT, and I'm using 2 Simultaneously tasks which runs 2 zip algorithms, the HT will not be efficient here , because one task will be wait for other task to finish the work one the ALU unit ? (and in this case I need 2 cores, instead of one core with HT).
Did I understand what Intel means with HT ? Is it right ?
You didn't understand it right. When they talk about I/O, they mean writing to memory and reading from memory, not file I/O. When that zip algorithm reads the next input byte from RAM, that's I/O. And when it writes a decoded byte to RAM, that's I/O.
A hyperthreaded CPU has usually one unit reading instructions from memory, two units decoding and dispatching instructions, two sets of architected registers (that's the processor registers that your program sees), one set of rename registers, one set of schedulers, and one set of ALUs, where a non-hyperthreaded core would have one of each, and two non-hyperthreaded cores would have two of each.

Resources