Why is this jump instruction so expensive when performing pointer chasing? - pointers

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.

Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

Related

Understanding CYCLE_ACTIVITY.* Haswell Performance-Monitoring Events

I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P < CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING?
CYCLE_ACTIVITY.CYCLES_L2_PENDING > CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING > CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING being 100x or even 1000x greater than *_L1D_PENDING it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES instead of STALLS. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26
I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L1D_PENDING, respectively.
First, note that the formula for %L2_Bound is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:
This section covers various performance tuning techniques using
performance monitoring events. Some techniques can be adapted in
general to other microarchitectures, most of the performance events
are specific to Intel microarchitecture code name Sandy Bridge.
My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache
miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2
miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to
L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of
loads missed L2.
The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss
pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss
pending demand load and no uops dispatched on this thread, increment
by 1.
There are three important differences:
Some of the Haswell events can only valid when HT is disabled. All SNB events are valid even when HT is enabled.
CYCLE_ACTIVITY.STALLS_L2_PENDING on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.
The HSW events include all accesses, not just demand loads. In contrast, the SNB events only occur for demand loads.
On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING, and that's why the %L2_Bound formula is valid on SNB.
Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.
Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING:
HSM63: The intended behavior of CYCLES_L2_PENDING on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.
HSM80: CYCLES_L2_PENDING may overcount due to requests from the next page prefetcher.
I think you can minimize the error in CYCLES_L2_PENDING by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.
Related: When L1 misses are a lot different than L2 accesses… TLB related?
Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLES_LDM_PENDING. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.

Instruction cycle (PIC18)

I'm trying to understand the steps that it takes to go through an instruction and their relationship with each oscillator cycle. The datasheet of the PIC18F4321 seems to divide this process into 2 basic steps: fetch and execution. But it does not seem to be consistent when saying which step belongs to which oscillator cycle. For example, it says:
Internally, the program counter is incremented on every Q1; the
instruction is fetched from the program memory and latched into
the Instruction Register (IR) during Q4.
This sounds odd, because it didn't mention Q2 and Q3. From this alone I would almost be led into thinking that fetching takes 1 oscillator cycle, since it happens in Q4. But reading just a little further, it says that:
The instruction fetch and execute are pipelined in such a manner that
a fetch takes one instruction cycle, while the decode and execute
take another instruction cycle. However, due to the pipelining, each
instruction effectively executes in one cycle.
So now it is telling me that fetching takes Q1 through Q4. Based on that, I would assume that if it were not for pipelining, instructions would take 2 instructions cycles to go through, since a fully instruction cycle is for fetching alone. But I understand how in practice pipelining would make it seem like it only takes 1 instruction cycle to go through an instruction. 
Still a little bit further, and I believe this is the most confusing part, it says that:
In the execution cycle, the fetched instruction is latched into the
Instruction Register (IR) in cycle Q1. This instruction is then
decoded and executed during the Q2, Q3 and Q4 cycles. Data memory is
read during Q2 (operand read) and written during Q4
(destination write).
Based on this and other sources I have read, it seems like it divides the execution part into decoding, reading, processing and writing (it confuses me because it keeps using the word execution when I don't think it's actually referring to the execution portion of "fetch and execution").
1) Now, when does it do each? It is very clear when it says that read/write will happen in Q2/Q4. So Q3 should be processing?
2) What is the oscillator cycle for decoding?
3) Why do you have to latch the instruction to IR again in Q1 if you just did that in Q4 when you fetched for this same instruction?
disclaimer: I've never written PIC asm code, let alone done any performance analysis of a PIC. I mostly know about more powerful CPUs, like x86, from reading http://agner.org/optimize/, and stuff on http://realworldtech.com/. This answer is just based on the snippets of the manual you put in your question, because they do make sense to me. I might be completely misinterpreting something.
So in terms of the external clock, it's a 2 cycle pipeline (fetch|execute), with a quad-pumped clock in the execution core. The execution stage is subdivided into 4 pipelined stages. A bit like how Pentium4 had double-pumped execution units (i.e. one pipeline stage that uses a faster clock).
It sounds like yes, instruction execution happens in Q3.
2) What is the oscillator cycle for decoding?
I don't understand the question. It decodes one instruction per input clock, using the unmultiplied clock.
3) Why do you have to latch the instruction to IR again in Q1 if you
just did that in Q4 when you fetched for this same instruction?
It sounds like the PC is incremented in Q1, so during instruction execution it points to the next instruction. In Q4, that next instruction is done being fetched into IR in preparation for executing it next cycle. This is the instruction data itself (i.e. what PC is pointing to). I'm not sure about this part, but this makes sense.

Single-cycle vs a pipelined approach

I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?
See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed

Long time between submit and start time of a command in OpenCL

I'm running a kernel on a big array. When I profile the clEnqueueNDRange command, the execution time (end-start) is .001 ms but the time between submit and start (start-submit) is around 120 ms which varies with the size of the input data. What happens when a command is submitted until it start to execute. Is it reasonable to get this large time?
OpenCL operates asynchronously. That is to say that when you ask for a piece of work to be done, it may not happen at that time. It will happen at some time in the future. This is a little weird, especially when you start profiling things, but it works like this so that the CPU can queue up lots of work for the OpenGL device, and then go do something else while the work is done.
For example:
clEnqueueWriteBuffer(blah);
clEnqueueNDRange(blah);
clEnqueueReadBuffer(blah, but blocking_read = CL_TRUE);
Here, the writeBuffer and the NDRange will probably appear to take very small amounts of time. All they'll do is record what needs to be done. The blocking readBuffer will take a long time, because it has to wait for the results of the read. For that read to complete, the write, and the kernel execution have to complete, before the read can even start.
Now the read might be very small, but because it's waiting for everything before it to finish the time it appears to take is dependent on the amount of work in the commands before it.
I don't quite understand what you're measuring from your question, but I expect what you're seeing is this effect. The time for work is being charged to other functions because they have to wait for previous work to finish.
Knowing which functions cause the CPU to wait on the GPU is one of the big tricks when it comes to writing high performance code. Any time you introduce a wait like this the CPU stops doing any useful work, and the GPU is likely to go idle whilst the CPU prepares the next lump of work. Sometimes, there no alternative, and you just have to wait.

Never terminating cpu bound c program

as you know there are two kind of process, i/o bound and cpu bound...
i need a cpu bound program that never terminates itself...
for example; is it like i wanted?
while(1){
for(int i=0;i<1000; i++);
}
for(;;); should do it!
First of all, why do you want a never terminating CPU bound program?
And yes, that would work, but you don't really need the inner for-loop. The while-loop will run forever on its own (assuming the compiler doesn't optimize it away).
There aren't only two kinds of processes. Even if you consider what resource is bounding the performance, there are more than two. The classic other ones are bandwidth, memory, database connections -- any finite resource or blocking one can be a bottleneck.
But, yes, your process is CPU-bound -- you can see that by looking at your task manager (Windows) or Activity Monitor (Mac) or top (Linux) and seeing it take 100% of your CPU.
Those programs probably aren't CPU bound.
I suggest implementing the Sieve of Eratosthenes, or something like that. How about a program that takes a number (say 42), divides it by Pi 1000 times, multiplies it by Pi 1000 times, subtracts the result from the original number, adds it to a variable and increments a counter. Then repeat that indefinitely. I suppose you might overflow one of the numeric values, but that should be fixable / preventable.

Resources