Computer Organization - How does "Predict taken"(always taken) branch prediction work?

Computer Organization - How does "Predict taken"(always taken) branch prediction work? - pipeline

I can understand how "predict untaken" work. It just move on fetching PC+4 instruction. Until the branch is resolved, if the branch is taken, then flushes all the instructions fetched before.
But I don't understand how does "predict taken" work. I think the branch instruction needs to be at decode stage(and the branch target address calculation need to be completed) before the processor can predict that it will be taken, right?
Then how does the "predict taken" be implemented on machine like MIPS 5-stage pipeline? (branch target address calculation and the branch is taken or not is decided at ID(instruction decode) stage)
If the branch can be resolved at ID stage, is it means prediction is done at IF(instruction fetch) stage?
I'm get confused because someone said "predict taken" or "predict untaken" are called "static branch prediction", compiler will do all the things. So in the "predict taken" case, compiler will insert the branch target instruction into the position after branch instruction.
Is my thought correct? or his phrase is correct?

MIPS has branch-delay slots that hide branch latency for a simple 5-stage pipeline trivially for unconditional branches (detected in ID, the stage after fetch), and even for conditional branches by evaluating them in the first half of EX, in time to forward to 2nd half of IF. (MIPS I R2000 did that).
But yes, completely avoiding fetch bubbles requires predicting the existence of branches before they're decoded, along with their target addresses. (Including for unconditional direct branches). Real predictors do that. See Slow jmp-instruction for an example on modern x86.
But that's very far from classic 5-stage RISC.
If you were putting such a dynamic predictor into a 5-stage RISC without branch-delay slots, e.g. a simple RISC-V, you'd maybe have it actually check ahead of where fetch is currently fetching, so you have a prediction for what to fetch in the next cycle.
You'd only use static always-taken prediction for conditional branches. (And usually only with a backwards displacement because those are often loop branches; predicting forward branches to be not-taken works well in practice, especially when compilers / programmers lay out their code accordingly so the common case for if()-type branches is not-taken). By the time you can detect that there's a branch at all, you already know if it's unconditional and don't need any prediction in that case.
If you don't already use tricks like MIPS I early eval of branch conditions, your branch latency would be 2 cycles (IF to EX) for conditional branches. Static always-taken prediction would shorten that to 1 cycle (IF to ID). Not 0, as you say, because the not-taken path is still being fetched while the branch instruction itself is being decoded.
i.e. you could design the ID stage to resteer fetch for next cycle when it sees a conditional branch. (Possibly after checking the displacement for forwards / backwards, i.e. just the high bit of a 2's complement value.)
So you optimize for fall-through of forward branches and looping backward branches because those are relatively common. To do even better you'd use a cache of dynamic predictions that you index by address, or in various complex ways (e.g. TAGE uses recent branch history as part of the index, and see https://danluu.com/branch-prediction/ for historical progress from very simple to less simple predictors).

Related

Arguing whether a situation leads to data hazard or not

I was going through the section of pipelining from the text Computer Organization [5e] by Hamacher et. al.. There I came across a situation which the authors claim causes data hazard.
The situation is shown below:
For example, stage E in the four-stage pipeline of Figure 8.2b is responsible for arithmetic and logic operations, and one clock cycle is assigned for this task. Although this may be sufficient for most operations, some operations, such as divide, may require more time to complete. Figure 8.3 shows an example in which the operation specified in instruction I2 requires three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage must be told to do nothing, because it has no data to work with. †: Meanwhile, the information in buffer B2 must remain intact until the Execute stage has completed its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting new instructions because the information in B1 cannot be overwritten. Thus, steps D4 and F5 must be postponed as shown.
... Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline.
In the example above, the authors assume that a data hazard has occurred, and two stall cycles are introduced into the pipeline. The main reason that they give for this data hazard is that, since the execute phase requires 2 more cycles than the usual need for instruction 2, so the data on which the write back stage should work has to wait for 2 cycles...
But I am having a little difficulty in accepting this analysis. Usually, the books give examples of data hazards in situations, where there is data dependency (the usual RAW, WAR, etc..). But here there is no such thing. And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Moreover, the text assumes that there is no queuing of the results of the stages in the buffer. Clear from the statement marked with †, Meanwhile, the information in the buffer..., (where there is a little flaw as well, because, if no queuing is there, then the output of D3 in cycle 4 shall overwrite the value in buffer B2 on which the EX stage is working, a contradiction to their own assumption).
I thought that the stalls are introduced due to this no queuing condition... and structural hazard, and if things are properly managed as shown below, no stalls shall be there.
This is what I assume:
I assume that the execute stage has more than one separate functional units (e.g. one where calculations of instruction 1 are performed. [basic ALU requiring 1 cycle duration], one for integer division, another for integer multiplication etc.) [So structural hazard is out of the way now.]
I also assume that the pipeline buffers can store the results produced in the stages in a queue. [So that the problem in statement marked with † is no longer there.]
This being said, the situation is now as follows:
However hard I tried with the assumptions, I could not remove the bubbles shown in blue. [Even if queuing is assumed in buffers, the buffers cannot give the result out of order, so those stalls remain].
With this exercise of mine, I feel that the example shown in the text is indeed a hazard and that too data hazard (even though there was no data dependencies ?), as in my exercise there was no chance of structural hazard...
Am I correct?

And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Yup, that's the terminology I'd use, based on wikipedia: https://en.wikipedia.org/wiki/Hazard_(computer_architecture).
That article restricts data hazards to only RAW, WAR, and WAW. As such, they're only visible when you consider which operands are being used.
e.g. an independent multiply (result not read by the next few insns) could be allowed to complete out of order, after executing in a separate multi-cycle or pipelined multiplier unit.
Write-back conflicts would be a problem if the slow ALU instruction needed to write a GPR in the same cycle as a later add or something. Also data hazards like WAW, since mul r3, r2, r1 / sw r3, (r4) / add r3, r2, r1 should leave r3 = r2+r1 not r2*r1.
MIPS solved all that with the special hi:lo reg pair for mult/div, allowing the mul and div units to be loosely coupled to the 5-stage pipeline. And the ISA had pretty relaxed rules about what was allowed to happen to those registers, e.g. writing one with mthi r3 destroys the previous value of the other, so mflo r2 would give unpredictable results after mthi. Raymond Chen's article.
An "in-order pipeline" means instructions start execution in program order, no necessarily that they complete in program order. It's very common for modern in-order pipelines to allow memory operations to complete out of order, allowing memory-level parallelism and allowing instruction scheduling to hide load-use latency of L1d cache hits. It's also possible to pipeline higher-latency ALU operations as long as hazards are detected and handled somehow.
Do these authors use the term "structural hazard" at all, or do they consider all (non-control?) hazards to be data hazards?
At this point it seems like primarily a terminology issue. IDK if they're on their own in using terminology this way, or if there is another convention with any popularity other than the one Wikipedia describes.
Separate from your main question, In clock cycles 4 and 5, you have two instructions in the E stage at the same time. If something stalls in the E stage, the stall bubbles need to come before the E stage in later instructions, like in the Fig 8.3 image you linked from the book.
And yeah, it's weird that they talk about the pipeline register between stages needing to stay constant. If a multi-cycle non-pipelined execution unit needs to keep values around, it could snapshot them.
Unless maybe the stall signal makes the Decode stage keep generating that output repeatedly until the stall signal is de-asserted and the pipeline register will finally latch the output of the previous stage instead of ignoring it. There are latches / flip-flops that have a control signal separate from the clock that makes them ignore their input and keep outputting what they were already outputting.

Does a processor stall even if there is (theoretically) perfect branch prediction irresp. of whether the Branch is taken or not-taken?

I am going through the textbook Computer Organization and Design and I am a bit confused with the Branch Prediction and how it works with a 5 stage pipeline scenario - IF ID EX MEM WB.
Consider the following sequence of instructions:
TOP: SUB X2, X2, X3
.
.
B.NE TOP
ADD X1, X1, X2
Assume the first case with no branch prediction and all possible forwarding paths. As per the textbook, when the Branch to the TOP is taken, the processor would incur a penalty of 1 stall. This is because after the B.NE instruction, the next instruction in the pipeline would be the ADD instruction when it should really have been the SUB instruction. The processor realizes that it inserted the incorrect instruction into the pipeline only at the end of the ID stage of the B.NE instruction and hence has to nop all the remaining stages of ADD (by the end of the ID stage it also manages to calculate the correct address to fetch the instruction from). So the pipeline in this case looks something like this:
However if the branch was not taken, there would have been no stalls. Because the next instruction would correctly have been the ADD instruction and the execution would have proceeded normally.
Now consider the same instructions and the same processor but with perfect branch prediction. Assume Branch is taken. The processor would know that the instruction is a Branch instruction only during the ID stage for the B.NE instruction. And the Branch Prediction would kick in only after that. By that time, the ADD instruction is already in the pipeline. Hence there would still be a penalty of 1 stall. So what is the advantage of even having the Branch Prediction? I am clearly missing something.
So I think I am confused with where exactly in the pipeline does the Branch Prediction kick in?

Pipeline and branch instructions

Lets suppose that 20 percent of the instructions in a program are branch instructions.The static prediction of the jumps supposes that the jumps don't happen.
I should find the execution time in two cases : When 30 percent of the branches happen and when 70 percent of the branches happen
I also should find the speedup of one case compared to the other and express it in percentage.
Thing is,how do I find the execution time here ? I usually find the execution time where the pipeline is separated in different phases and there is given the time for each phase ....
Edit : This is NOT homework.I found this in my computer architecture textbook and its not a familiar exercise.

This question sounds like homework but the matter is worth some discussion.
We assume to have a static branch predictor that always predicts NOT TAKEN. This was the type of branch predictor of early SPARC and MIPS implementations. Such a branch predictor always fetches the next sequential instruction in the program.
Let me also assume that we have a simplified 4 stage pipeline made of Fetch (F), Decode (D), Execute (E) and Write Back (W). Consider the following simplified assembly program:
...
0xF1: JUMP <condition>, 0xF4
0xF2: ADD r1, r2, r3
0xF3: ADD r3, r4, r1
0xF4: ADD r1, r2, r3
When a branch is correctly predicted the pipeline behaves normally. The question is what happens to the pipeline when a branch is mis-predicted. Which in our case corresponds to the case when the condition of the JUMP instruction (0xF1) is verified.
0xF1: F D E W
0xF2: F D X
0xF3: F X
0xF4: F
cycle 1 2 3 4
In the Execute stage of the JUMP instruction we evaluate the condition and detect that the branch has to be taken. Due to the branch predictor policy, however, we already fetched instructions 0xF2 and 0xF3 and decoded 0xF2. The pipeline is flushed and at the next clock cycle the branch target is correctly fetched. As you can see from the pipeline we wasted 2 clock cycles fetching and decoding instructions that will not be executed. This 2 clock cycles are known as branch penalties and you must take them into account when calculating the program's execution time.
The world of branch predictors is much more complex in reality. More elaborated static branch predictors exist that, for instance, always predict as TAKEN a forward jump and as NOT TAKEN backward ones. To reduce the branch penalty cycles processors often employ a Branch Target Buffer (BTB) that is a small cache that stores the target of recently executed JUMP instructions. Without a BTB, to predict a branch as TAKEN we have to wait until the Decode stage, where the instruction is identified as a JUMP and the target address is decoded. In the meantime we have fetched an instruction that will then be flushed. With a BTB, on the other hand, we can do branch prediction in the Fetch stage: if the Program Counter is in the BTB we know 2 that
The fetched instruction is a branch
We have its target address
So if can predict the branch and if predicted as TAKEN we can fetch its target without any penalty.
Modern processors also adopt dynamic branch predictors that use complex policies as well as some additional buffers to avoid mis-predictions.

Modeling an HTTP transition system in Alloy

I want to model an HTTP interaction, i.e. a sequence of HTTPRequest/HTTPResponse, and I am trying to model this as a transition system.
I defined an ordering on a class State by using:
open util/ordering[State]
where a State is simply a set of Messages:
sig State {
msgSet: set Message
}
Each pair of (HTTPRequest->HTTPResponse) and (HTTPResponse->HTTPRequest) is represented as a rule in my transition system.
The rules are expressed in Alloy as predicates that let one move from one state to another.
E.g., this is a rule generating an HTTPResponse after a particular HTTPRequest is received:
pred rsp1 [s, s': State] {
one msg: Request, msg':Response | (
// Preconditions (previous Request)
msg.method=get &&
msg.address.url=sample_com &&
// Postconditions (next Response)
msg'.status=OK_200 &&
// previous Request has to be in previous state
msg in s.msgSet &&
// Response generated is added to next state
s'.msgSet = s.msgSet + msg'
}
Unfortunately, the model created seems to be too complex: we have a dozen of rules (more complex than the one above but following the same pattern) and the execution is very slow.
EDIT: In particular, the CNF generation is extremely slow, while the solving takes a reasonable amount of time.
Do you have any suggestion on how to model a similar transition system?
Thank you very much!

This is a model with an impressive level of detail; thank you for sharing it!
None of the various forms of honestAction by itself takes more than two or three minutes to find an instance (or in some cases to fail to find any instance), except for rsp8, which takes quite a while by itself (it ran for fifteen minutes or so before I stopped it).
So the long CNF preparation times you are observing are apparently caused by either (a) just predicate rsp8 that's causing your time issues, or (b) the size of the disjunction in the honestAction predicate, or (c) both.
I suspect but have not proved that the time issue is caused by combinatorial explosion in the number of individuals required to populate a model and the number of constraints in the model.
My first instinct (it's not more than that) would be to cut back on the level of detail in the model, in particular the large number of singleton signatures which instantiate your abstract signatures. These seem (I could be wrong) to be present either for bookkeeping purposes (so you can identify which rule licenses the transition from one state to another), or because the modeler doesn't trust Alloy to generate concrete instances of signatures like UserName, Password, Code, etc.
As the model now is, it looks as if you're doing a lot of work to define all the individuals involved in a particular example, instead of defining constraints and letting Alloy do the work of finding examples. (Using Alloy to check the properties a particular concrete example can be useful, but there are other ways to do that.)
Since so many of the concrete signatures in the model are constrained to singleton cardinality, I don't actually know that defining them makes the task of finding models more complex; for all I know, it makes it simpler. But my instinct is to think that it would be more useful to know (as well as possibly easier for Alloy to establish) that state transitions have a particular property in general, no matter what hosts, users, and URIs are involved, than to know that property rsp1 applies in all the cases where the host is named examplecom and the address URI is example_url_https and whatnot.
I conjecture that reducing the number of individuals whose existence and properties are prescribed, and the constraints on which individuals can be involved in which state transitions, will reduce the CNF generation time.
If your long-term goal is to test long sequences of state transitions to test whether from a given starting point it's possible or impossible to arrive at a particular state (or kind of state), you may need to re-think the approach to enable shorter sequences of state transitions to do the job.
A second conjecture would involve less restructuring of the model. For reasons I don't think I understand fully, sometimes quantification with one seems to hurt rather than help performance, as in this example, where explicitly quantifying some variables with some instead of one turned out to make a problem tractable instead of intractable.
That question involves quantification in a predicate, not in the model overall, and the quantification with one wasn't intended in the first place, so it may not be relevant here. But we can test the effect of the one keyword on this model in a simple way: I commented out everything in honestAction except rsp8 and ran the predicate first != last in a scope of 8, once with most of the occurrences of one commented out and once with those keywords intact. With the one keywords commented out, the Analyser ran the problem in 24 seconds or so; with the one keywords in place, it ran for 500 seconds so far before I decided the point was made and terminated it.
So I'd try removing the keyword one from all of the signatures with instance-specific individuals, leaving it only on get, post, OK_200, etc., and appData. I would also try doing without the various subtypes of Key, SessionID, URL, Host, UserName, and Password, or at least constraining their cardinality in the run command.

why should we store NPC in pipeline regisger?

I just touch on pipeline theory for a few hours. perhaps it's a easy question, but I really need your help.
I know that we should store mem[pc] into IF/ID pipeline register in fetch stage for we will decode it in next stage, also we should update PC in fetch stage for we will feteh next instruction via that updated PC next cycle, but I really don't understand why we should also store NPC into pipeline register.
below is an explanation derived from Computer Organization and Design, I don't get it.
This incremented address is also saved in the IF/ID pipeline register in case it is
needed later for an instruction, such as beq

The reason for saving NPC in the pipeline is because sometimes the next instruction in the pipeline will want to use it.
Look at the definition of beq. It has to compute the target address of the branch. Some branches use a fixed location for the target address, like "branch to address A." This is called "branching to an absolute address."
Another kind of branch is a "relative" branch, where the branch target is not an absolute address but an offset, that is, "branch forward X instructions." (If X is negative, this ends up being a backwards branch.) Now consider this: forwards/backwards from where? From NPC. That is, for a relative branch instruction, the computation for the new PC value is:
NewPC = NPC + X
Why do architectures include the ability to perform relative branches? Because it takes less space. Lets say that X has a small value, like 16. The storage required for an absolute branch to a target address is:
sizeof(branch opcode) + sizeof(address)
But the storage for a relative branch of offset 16 is only:
sizeof(branch opcode) + 1 ## number of bytes needed to hold the value 16!
Of course, larger offsets can be accommodated by increasing the number of bytes used to hold the offset value. Other kinds of space-saving, range-increasing representations are possible too.

If the exception point is in a branch-delay slot, then one needs two PCs to restart execution: one that
points at the exceptional instruction (delay slot) and another that points at the next instruction. The
second PC is needed because the instruction following the delay slot could be either the next
sequential instruction (if the branch was not taken) or the branch target (if the branch was taken).
Although MIPS has the same issue, it relies on software to back up the exception point to the previous
instruction (when it is a branch) before restarting execution; this works because branches are
idempotent.
Credits: http://www.cs.berkeley.edu/~kubitron/courses/cs252-S09/handouts/oldquiz/sp09-quiz1_soln.pdf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex