why should we store NPC in pipeline regisger? - fetch

I just touch on pipeline theory for a few hours. perhaps it's a easy question, but I really need your help.
I know that we should store mem[pc] into IF/ID pipeline register in fetch stage for we will decode it in next stage, also we should update PC in fetch stage for we will feteh next instruction via that updated PC next cycle, but I really don't understand why we should also store NPC into pipeline register.
below is an explanation derived from Computer Organization and Design, I don't get it.
This incremented address is also saved in the IF/ID pipeline register in case it is
needed later for an instruction, such as beq

The reason for saving NPC in the pipeline is because sometimes the next instruction in the pipeline will want to use it.
Look at the definition of beq. It has to compute the target address of the branch. Some branches use a fixed location for the target address, like "branch to address A." This is called "branching to an absolute address."
Another kind of branch is a "relative" branch, where the branch target is not an absolute address but an offset, that is, "branch forward X instructions." (If X is negative, this ends up being a backwards branch.) Now consider this: forwards/backwards from where? From NPC. That is, for a relative branch instruction, the computation for the new PC value is:
NewPC = NPC + X
Why do architectures include the ability to perform relative branches? Because it takes less space. Lets say that X has a small value, like 16. The storage required for an absolute branch to a target address is:
sizeof(branch opcode) + sizeof(address)
But the storage for a relative branch of offset 16 is only:
sizeof(branch opcode) + 1 ## number of bytes needed to hold the value 16!
Of course, larger offsets can be accommodated by increasing the number of bytes used to hold the offset value. Other kinds of space-saving, range-increasing representations are possible too.

If the exception point is in a branch-delay slot, then one needs two PCs to restart execution: one that
points at the exceptional instruction (delay slot) and another that points at the next instruction. The
second PC is needed because the instruction following the delay slot could be either the next
sequential instruction (if the branch was not taken) or the branch target (if the branch was taken).
Although MIPS has the same issue, it relies on software to back up the exception point to the previous
instruction (when it is a branch) before restarting execution; this works because branches are
idempotent.
Credits: http://www.cs.berkeley.edu/~kubitron/courses/cs252-S09/handouts/oldquiz/sp09-quiz1_soln.pdf

Related

Does a processor stall even if there is (theoretically) perfect branch prediction irresp. of whether the Branch is taken or not-taken?

I am going through the textbook Computer Organization and Design and I am a bit confused with the Branch Prediction and how it works with a 5 stage pipeline scenario - IF ID EX MEM WB.
Consider the following sequence of instructions:
TOP: SUB X2, X2, X3
.
.
B.NE TOP
ADD X1, X1, X2
Assume the first case with no branch prediction and all possible forwarding paths. As per the textbook, when the Branch to the TOP is taken, the processor would incur a penalty of 1 stall. This is because after the B.NE instruction, the next instruction in the pipeline would be the ADD instruction when it should really have been the SUB instruction. The processor realizes that it inserted the incorrect instruction into the pipeline only at the end of the ID stage of the B.NE instruction and hence has to nop all the remaining stages of ADD (by the end of the ID stage it also manages to calculate the correct address to fetch the instruction from). So the pipeline in this case looks something like this:
However if the branch was not taken, there would have been no stalls. Because the next instruction would correctly have been the ADD instruction and the execution would have proceeded normally.
Now consider the same instructions and the same processor but with perfect branch prediction. Assume Branch is taken. The processor would know that the instruction is a Branch instruction only during the ID stage for the B.NE instruction. And the Branch Prediction would kick in only after that. By that time, the ADD instruction is already in the pipeline. Hence there would still be a penalty of 1 stall. So what is the advantage of even having the Branch Prediction? I am clearly missing something.
So I think I am confused with where exactly in the pipeline does the Branch Prediction kick in?

Computer Organization - How does "Predict taken"(always taken) branch prediction work?

I can understand how "predict untaken" work. It just move on fetching PC+4 instruction. Until the branch is resolved, if the branch is taken, then flushes all the instructions fetched before.
But I don't understand how does "predict taken" work. I think the branch instruction needs to be at decode stage(and the branch target address calculation need to be completed) before the processor can predict that it will be taken, right?
Then how does the "predict taken" be implemented on machine like MIPS 5-stage pipeline? (branch target address calculation and the branch is taken or not is decided at ID(instruction decode) stage)
If the branch can be resolved at ID stage, is it means prediction is done at IF(instruction fetch) stage?
I'm get confused because someone said "predict taken" or "predict untaken" are called "static branch prediction", compiler will do all the things. So in the "predict taken" case, compiler will insert the branch target instruction into the position after branch instruction.
Is my thought correct? or his phrase is correct?
MIPS has branch-delay slots that hide branch latency for a simple 5-stage pipeline trivially for unconditional branches (detected in ID, the stage after fetch), and even for conditional branches by evaluating them in the first half of EX, in time to forward to 2nd half of IF. (MIPS I R2000 did that).
But yes, completely avoiding fetch bubbles requires predicting the existence of branches before they're decoded, along with their target addresses. (Including for unconditional direct branches). Real predictors do that. See Slow jmp-instruction for an example on modern x86.
But that's very far from classic 5-stage RISC.
If you were putting such a dynamic predictor into a 5-stage RISC without branch-delay slots, e.g. a simple RISC-V, you'd maybe have it actually check ahead of where fetch is currently fetching, so you have a prediction for what to fetch in the next cycle.
You'd only use static always-taken prediction for conditional branches. (And usually only with a backwards displacement because those are often loop branches; predicting forward branches to be not-taken works well in practice, especially when compilers / programmers lay out their code accordingly so the common case for if()-type branches is not-taken). By the time you can detect that there's a branch at all, you already know if it's unconditional and don't need any prediction in that case.
If you don't already use tricks like MIPS I early eval of branch conditions, your branch latency would be 2 cycles (IF to EX) for conditional branches. Static always-taken prediction would shorten that to 1 cycle (IF to ID). Not 0, as you say, because the not-taken path is still being fetched while the branch instruction itself is being decoded.
i.e. you could design the ID stage to resteer fetch for next cycle when it sees a conditional branch. (Possibly after checking the displacement for forwards / backwards, i.e. just the high bit of a 2's complement value.)
So you optimize for fall-through of forward branches and looping backward branches because those are relatively common. To do even better you'd use a cache of dynamic predictions that you index by address, or in various complex ways (e.g. TAGE uses recent branch history as part of the index, and see https://danluu.com/branch-prediction/ for historical progress from very simple to less simple predictors).

How can the processor discern a far return from a near return?

Reading Intel's big manual, I see that if you want to return from a far call, that is, a call to a procedure in another code segment, you simply issue a return instruction (possibly with an immediate argument that moves the stack pointer up n bytes after the pointer popping).
This, apparently, if I'm interpreting things correctly, is enough for the hardware to pop both the segment selector and offset into the correct registers.
But, how does the system know that the return should be a far return and that both an offset AND a selector need to be popped?
If the hardware just pops the offset pointer and not the selector after it, then you'll be pointing to the right offset but wrong segment.
There is nothing special about the far return command compared to the near return version.
They both look identical as far as I can tell.
I assume then that the processor, perhaps at the micro-architecture level, keeps track of which calls are far and which are close so that when they're returned from, the system knows how many bytes to pop and where to pop them (pointer registers and segment selector registers).
Is my assumption correct?
What do you guys know about this mechanism?
The processor doesn't track whether or not a call should be far or near; the compiler decides how to encode the function call and return using either far or near opcodes.
As it is, FAR calls have no use on modern processors because you don't need to change any segment register values; that's the point of a flat memory model. Segment registers still exist, but the OS sets them up with base=0 and limit=0xffffffff so just a plain 32-bit pointer can access all memory. Everything is NEAR, if you need to put a name on it.
Normally you just don't even think about segmentation so you don't actually call it either. But the manual still describes the call/ret opcodes we use for normal code as the NEAR versions.
FAR and NEAR were used on old 86 processors, which used a segmented memory model. Programs at that time needed to choose what kind of architecture they wished to support, ranging from "tiny" to "large". If your program was small enough to fit in a single segment, then it could be compiled using NEAR calls and returns exclusively. If it was "large", the opposite was true. For anything in between, you had power to choose whether local functions needed to be able to be either callable/returnable from code in another segment.
Most modern programs (besides bootloaders and the like) run on a different construct: they expect a flat memory model. Behind the scenes the OS will swap out memory as needed (with paging not segmentation), but as far as the program is concerned, it has its virtual address space all to itself.
But, to answer your question, the difference in the call/return is the opcode used; the processor obeys the command given to it. If you mistake (say, give it a FAR return opcode when in flat mode), it'll fail.

Dissasemble 68xx code without entry point vector

I am trying to disassemble a code from a old radio containing a 68xx (68hc12 like) microcontroller. The problem is, I dont have the access to the interrupt vector of the micro in the top of the ROM, so I don't know where start to look. I only have the code below the top. There is some suggestion of where or how can I find meaningful routines in the code data?
You can't really disassemble reliably without knowing where the reset vector points. What you can do, however, is try to narrow down the possible reset addresses by eliminating all those other addresses that cannot possibly be a starting point.
So, given that any address in the memory map that contains a valid opcode is a potential reset point, you need to either eliminate it, or keep it for further analysis.
For the 68HC11 case, you could try to guess somewhat the entry point by looking for LDS instructions with legitimate operand value (i.e., pointing at or near the top of available RAM -- if multiple RAM banks, then to any of them).
It may help a bit if you know the device's full memory map, i.e., if external memory is used, its mapping and possible mapped peripherals (e.g., LCD). Do you also know CONFIG register contents?
The LDS instruction is usually either the very first instruction, or close thereafter (so look back a few instructions when you feel you have finally singled out your reset address). The problem here is some data may, by chance, appear as LDS instructions so you could end up with multiple potentially valid entry points. Only one of them is valid, of course.
You can eliminate further by disassembling a few instructions starting from each of these LDS instructions until you either hit an illegal opcode (i.e. obviously not a valid code sequence but an accidental data arrangement that looks like opcodes), or you see a series of instructions that are commonly used in 68HC11 initialization. These involve (usually) initialization of any one or more of the registers BPROT, OPTION, SCI, INIT ($103D in most parts, but for some $3D), etc.
You could write a relatively small script (e.g., in Lua) to do the basic scanning of the memory map and produce a (hopefully small) set of potential reset points to be examined further with a true disassembler for hints like the ones I mentioned.
Now, once you have the reset vector figured out the job becomes somewhat easier but you still need to figure out where any interrupt handlers are located. For this your hint is an RTI instruction and whatever preceding code that normally should acknowledge the specific interrupt it handles.
Hope this helps.

Pipeline and branch instructions

Lets suppose that 20 percent of the instructions in a program are branch instructions.The static prediction of the jumps supposes that the jumps don't happen.
I should find the execution time in two cases : When 30 percent of the branches happen and when 70 percent of the branches happen
I also should find the speedup of one case compared to the other and express it in percentage.
Thing is,how do I find the execution time here ? I usually find the execution time where the pipeline is separated in different phases and there is given the time for each phase ....
Edit : This is NOT homework.I found this in my computer architecture textbook and its not a familiar exercise.
This question sounds like homework but the matter is worth some discussion.
We assume to have a static branch predictor that always predicts NOT TAKEN. This was the type of branch predictor of early SPARC and MIPS implementations. Such a branch predictor always fetches the next sequential instruction in the program.
Let me also assume that we have a simplified 4 stage pipeline made of Fetch (F), Decode (D), Execute (E) and Write Back (W). Consider the following simplified assembly program:
...
0xF1: JUMP <condition>, 0xF4
0xF2: ADD r1, r2, r3
0xF3: ADD r3, r4, r1
0xF4: ADD r1, r2, r3
When a branch is correctly predicted the pipeline behaves normally. The question is what happens to the pipeline when a branch is mis-predicted. Which in our case corresponds to the case when the condition of the JUMP instruction (0xF1) is verified.
0xF1: F D E W
0xF2: F D X
0xF3: F X
0xF4: F
cycle 1 2 3 4
In the Execute stage of the JUMP instruction we evaluate the condition and detect that the branch has to be taken. Due to the branch predictor policy, however, we already fetched instructions 0xF2 and 0xF3 and decoded 0xF2. The pipeline is flushed and at the next clock cycle the branch target is correctly fetched. As you can see from the pipeline we wasted 2 clock cycles fetching and decoding instructions that will not be executed. This 2 clock cycles are known as branch penalties and you must take them into account when calculating the program's execution time.
The world of branch predictors is much more complex in reality. More elaborated static branch predictors exist that, for instance, always predict as TAKEN a forward jump and as NOT TAKEN backward ones. To reduce the branch penalty cycles processors often employ a Branch Target Buffer (BTB) that is a small cache that stores the target of recently executed JUMP instructions. Without a BTB, to predict a branch as TAKEN we have to wait until the Decode stage, where the instruction is identified as a JUMP and the target address is decoded. In the meantime we have fetched an instruction that will then be flushed. With a BTB, on the other hand, we can do branch prediction in the Fetch stage: if the Program Counter is in the BTB we know 2 that
The fetched instruction is a branch
We have its target address
So if can predict the branch and if predicted as TAKEN we can fetch its target without any penalty.
Modern processors also adopt dynamic branch predictors that use complex policies as well as some additional buffers to avoid mis-predictions.

Resources