Instruction cycle (PIC18) - fetch

I'm trying to understand the steps that it takes to go through an instruction and their relationship with each oscillator cycle. The datasheet of the PIC18F4321 seems to divide this process into 2 basic steps: fetch and execution. But it does not seem to be consistent when saying which step belongs to which oscillator cycle. For example, it says:
Internally, the program counter is incremented on every Q1; the
instruction is fetched from the program memory and latched into
the Instruction Register (IR) during Q4.
This sounds odd, because it didn't mention Q2 and Q3. From this alone I would almost be led into thinking that fetching takes 1 oscillator cycle, since it happens in Q4. But reading just a little further, it says that:
The instruction fetch and execute are pipelined in such a manner that
a fetch takes one instruction cycle, while the decode and execute
take another instruction cycle. However, due to the pipelining, each
instruction effectively executes in one cycle.
So now it is telling me that fetching takes Q1 through Q4. Based on that, I would assume that if it were not for pipelining, instructions would take 2 instructions cycles to go through, since a fully instruction cycle is for fetching alone. But I understand how in practice pipelining would make it seem like it only takes 1 instruction cycle to go through an instruction. 
Still a little bit further, and I believe this is the most confusing part, it says that:
In the execution cycle, the fetched instruction is latched into the
Instruction Register (IR) in cycle Q1. This instruction is then
decoded and executed during the Q2, Q3 and Q4 cycles. Data memory is
read during Q2 (operand read) and written during Q4
(destination write).
Based on this and other sources I have read, it seems like it divides the execution part into decoding, reading, processing and writing (it confuses me because it keeps using the word execution when I don't think it's actually referring to the execution portion of "fetch and execution").
1) Now, when does it do each? It is very clear when it says that read/write will happen in Q2/Q4. So Q3 should be processing?
2) What is the oscillator cycle for decoding?
3) Why do you have to latch the instruction to IR again in Q1 if you just did that in Q4 when you fetched for this same instruction?

disclaimer: I've never written PIC asm code, let alone done any performance analysis of a PIC. I mostly know about more powerful CPUs, like x86, from reading http://agner.org/optimize/, and stuff on http://realworldtech.com/. This answer is just based on the snippets of the manual you put in your question, because they do make sense to me. I might be completely misinterpreting something.
So in terms of the external clock, it's a 2 cycle pipeline (fetch|execute), with a quad-pumped clock in the execution core. The execution stage is subdivided into 4 pipelined stages. A bit like how Pentium4 had double-pumped execution units (i.e. one pipeline stage that uses a faster clock).
It sounds like yes, instruction execution happens in Q3.
2) What is the oscillator cycle for decoding?
I don't understand the question. It decodes one instruction per input clock, using the unmultiplied clock.
3) Why do you have to latch the instruction to IR again in Q1 if you
just did that in Q4 when you fetched for this same instruction?
It sounds like the PC is incremented in Q1, so during instruction execution it points to the next instruction. In Q4, that next instruction is done being fetched into IR in preparation for executing it next cycle. This is the instruction data itself (i.e. what PC is pointing to). I'm not sure about this part, but this makes sense.

Related

Arguing whether a situation leads to data hazard or not

I was going through the section of pipelining from the text Computer Organization [5e] by Hamacher et. al.. There I came across a situation which the authors claim causes data hazard.
The situation is shown below:
For example, stage E in the four-stage pipeline of Figure 8.2b is responsible for arithmetic and logic operations, and one clock cycle is assigned for this task. Although this may be sufficient for most operations, some operations, such as divide, may require more time to complete. Figure 8.3 shows an example in which the operation specified in instruction I2 requires three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage must be told to do nothing, because it has no data to work with. †: Meanwhile, the information in buffer B2 must remain intact until the Execute stage has completed its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting new instructions because the information in B1 cannot be overwritten. Thus, steps D4 and F5 must be postponed as shown.
... Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline.
In the example above, the authors assume that a data hazard has occurred, and two stall cycles are introduced into the pipeline. The main reason that they give for this data hazard is that, since the execute phase requires 2 more cycles than the usual need for instruction 2, so the data on which the write back stage should work has to wait for 2 cycles...
But I am having a little difficulty in accepting this analysis. Usually, the books give examples of data hazards in situations, where there is data dependency (the usual RAW, WAR, etc..). But here there is no such thing. And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Moreover, the text assumes that there is no queuing of the results of the stages in the buffer. Clear from the statement marked with †, Meanwhile, the information in the buffer..., (where there is a little flaw as well, because, if no queuing is there, then the output of D3 in cycle 4 shall overwrite the value in buffer B2 on which the EX stage is working, a contradiction to their own assumption).
I thought that the stalls are introduced due to this no queuing condition... and structural hazard, and if things are properly managed as shown below, no stalls shall be there.
This is what I assume:
I assume that the execute stage has more than one separate functional units (e.g. one where calculations of instruction 1 are performed. [basic ALU requiring 1 cycle duration], one for integer division, another for integer multiplication etc.) [So structural hazard is out of the way now.]
I also assume that the pipeline buffers can store the results produced in the stages in a queue. [So that the problem in statement marked with † is no longer there.]
This being said, the situation is now as follows:
However hard I tried with the assumptions, I could not remove the bubbles shown in blue. [Even if queuing is assumed in buffers, the buffers cannot give the result out of order, so those stalls remain].
With this exercise of mine, I feel that the example shown in the text is indeed a hazard and that too data hazard (even though there was no data dependencies ?), as in my exercise there was no chance of structural hazard...
Am I correct?
And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Yup, that's the terminology I'd use, based on wikipedia: https://en.wikipedia.org/wiki/Hazard_(computer_architecture).
That article restricts data hazards to only RAW, WAR, and WAW. As such, they're only visible when you consider which operands are being used.
e.g. an independent multiply (result not read by the next few insns) could be allowed to complete out of order, after executing in a separate multi-cycle or pipelined multiplier unit.
Write-back conflicts would be a problem if the slow ALU instruction needed to write a GPR in the same cycle as a later add or something. Also data hazards like WAW, since mul r3, r2, r1 / sw r3, (r4) / add r3, r2, r1 should leave r3 = r2+r1 not r2*r1.
MIPS solved all that with the special hi:lo reg pair for mult/div, allowing the mul and div units to be loosely coupled to the 5-stage pipeline. And the ISA had pretty relaxed rules about what was allowed to happen to those registers, e.g. writing one with mthi r3 destroys the previous value of the other, so mflo r2 would give unpredictable results after mthi. Raymond Chen's article.
An "in-order pipeline" means instructions start execution in program order, no necessarily that they complete in program order. It's very common for modern in-order pipelines to allow memory operations to complete out of order, allowing memory-level parallelism and allowing instruction scheduling to hide load-use latency of L1d cache hits. It's also possible to pipeline higher-latency ALU operations as long as hazards are detected and handled somehow.
Do these authors use the term "structural hazard" at all, or do they consider all (non-control?) hazards to be data hazards?
At this point it seems like primarily a terminology issue. IDK if they're on their own in using terminology this way, or if there is another convention with any popularity other than the one Wikipedia describes.
Separate from your main question, In clock cycles 4 and 5, you have two instructions in the E stage at the same time. If something stalls in the E stage, the stall bubbles need to come before the E stage in later instructions, like in the Fig 8.3 image you linked from the book.
And yeah, it's weird that they talk about the pipeline register between stages needing to stay constant. If a multi-cycle non-pipelined execution unit needs to keep values around, it could snapshot them.
Unless maybe the stall signal makes the Decode stage keep generating that output repeatedly until the stall signal is de-asserted and the pipeline register will finally latch the output of the previous stage instead of ignoring it. There are latches / flip-flops that have a control signal separate from the clock that makes them ignore their input and keep outputting what they were already outputting.

Why is this jump instruction so expensive when performing pointer chasing?

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Understanding STM8 pipelining

I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.
I have this example, where I toggle a GPIO pin for 4 cycles each.
Iff loop is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?
// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
__asm
bset 0x5011, #2 ; output mode
bset 0x5012, #2 ; push-pull
bset 0x5013, #2 ; fast switching
jra _loop
.bndry 4
nop
nop
nop
_loop:
nop
bset 0x500f, #2
nop
nop
nop
bres 0x500f, #2
jra _loop
__endasm;
}
A bit more context:
bset/bres are 4 byte instructions, nop 1 byte.
The nop/bset/bres instructions take 1 cycle each.
The jra instruction takes two cycles. I think in the first cycle, the instruction cache is filled with the next 32bit value, i.e. in this case the nop instruction only. And the 2nd cycle is actually just the CPU being stalled while decoding the next instruction.
So in cycles:
bres clears the pin
jra, pipeline flush, nop fetch
nop decode, bset fetch
nop execute, bset decode, next nop fetch
bset execute sets the pin
nop, bres fetch
nop
nop, bres decode
bres execute clears the pin
According to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.
In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.
I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset instruction (the nops thereafter provide enough time to make sure that bres later is ready to execute immediately). But according to my understanding nop (for 6.) would already be fetched in 4.
Any idea how this behavior can be explained? I couldn’t find any hints in the manual.
It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.
Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).
Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.
Any additional insight appreciated!

Single-cycle vs a pipelined approach

I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?
See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed

Resources