Pipeline hazard handling

Pipeline hazard handling - pipeline

I0: slli $s2, $s1, 4
I1: beq $s1, $zero, top
I2: addi $s3, $s2, 6
I3: mult $t2, $s3, $s1
I4: addi $s4, $s2, 8
I5: sw $t2, 0($s4)
Consider a pipeline without any hazard handling. The pipeline is the typical 5- stage IF, ID, EX, MEM, WB MIPS design. For the above code, complete the pipeline for the code. Insert the characters IF(instruction fetch), ID(instruction decode), EX(execution), M(memory), WB(write back) for each instruction in the boxes. Do you guys think my chart is correct?
Thanks!http://imgur.com/PbJ2egd

First let's plot out which instructions rely on which outputs of preceding instructions:
I0: Relies on nothing here
I1: Relies on nothing here, but is a branch
I2: Relies on I0 ($s2)
I3: Relies on I2 ($s3)
I4: Relies on I0 ($s2)
I5: Relies on I4 ($s4)
So when an instruction relies on another, like I5 on I4, its EX block cannot run until the instruction it is relying on finishes its WB block. In the case of I5, we can see this clearly, since the EX block only starts once I4's WB block is done.
Also note that branches prevent the next instruction from starting at all until its EX block has finishes.
With these two rules we can go instruction by instruction and plot it out:
I0: Relies on no outputs.
I1: Relies on no outputs, but note that it is a branch. The next instruction cannot start till EX finishes.
I2: Relies on I0's output, so wait for I0's WB, but also for I1's EX because I1 was a branch. I1's EX is the worse case, so wait till then. This stalls 2 blocks.
I3: Relies on I2's output, so wait for I2's WB. This stalls 2 blocks. (We've now stalled 4 total)
I4: Relies on I0's output, so wait for I0's WB. This stalls 0 blocks, because I0 long since has completed. (We've now stalled 4 total)
I5: Relies on I4's output, so wait for I4's WB. This stalls 2 blocks. (We've now stalled 6 total)
So in the end we stall 3 times, each for two blocks. This equals the 6 "x"s your teacher drew.

This is the correct answer that she gave us from yesterday's review session. Can you point me in the right direction on how to answer this question? How do you know when to delay, put X's, etc? I am sorry I can't give exact details of the answer, but just know this is the correct answer.

Related

How to write a MIPS recursion program in descending and then ascending order

I am trying to write a MIPS program that takes in user input (integer n) and then prints all the descending numbers until 1 and then all the ascending numbers up to n.
Basically if I input 3, the output would be : 3 2 1 1 2 3
In C#, the code would be:
using System;
public static class LearnRecursion
{
public static void Main()
{
int n;
Console.Write("Enter an integer: ");
n = Convert.ToInt32(Console.ReadLine());
RDemo(n);
}
public static void RDemo(int n)
{
if (n < 1)
{
return;
}
else
{
Console.Write("{0} ", n);
RDemo(n - 1);
Console.Write("{0} ", n);
return;
}
}
}
I have tried implementing the MIPS program like this:
.data
### Declare appropriate strings and the space character for I/O prompts ###
input: .asciiz "Enter an integer : "
space: .asciiz " "
newline: .asciiz "\n"
.text
main:
### call procedure for printing the user prompt ###
li $v0, 4
la $a0, input
syscall
#read input from user
li $v0, 5
syscall
move $s0, $v0 #store the user input into saved register
move $a0, $s0 #move saved user input integer as argument for RDemo
jal RDemo
j exit
# recursive RDemo method
# excpects integer argument (from user input) in $a0
#returns when n<1
RDemo:
#make space for 4 registers on the stack
addi $sp, $sp, -12
sw $ra, 0($sp) #return adress
sw $s0, 4($sp) #saved register (original n)
sw $a0, 8($sp) #argument (user parsed n)
#base case: n < 1 return
blez $a0, RDemoReturn
#print n
li $v0, 1
move $a0, $a0
syscall
la $a0, space
li $v0, 4
syscall
#call RDemo with n-1
addi $a0, $a0, -1
jal RDemo
li $v0, 1
move $a0, $s0 #$s0 or $a0 ?
syscall
la $a0, space
li $v0, 4
syscall
RDemoReturn:
lw $ra, 0($sp)
lw $s0, 4($sp)
lw $a0, 8($sp)
addi $sp, $sp, 12
jr $ra
exit:
li $v0, 10
syscall
It ends up printing an endless loop, with only the original integer n, and then a bunch of numbers that look like adresses i.e 24567 etc.
Does anyone know what is wrong with my program?

That code is making several calling convention related mistakes.  In particular, it goes to who owns what register and when.
During single step debugging, you'll notice these mistakes as bad values in registers, but perhaps don't understand what the right way should be.
The function prologue saves away $s0, $a0 and $ra.  One of $s0 and $a0 is unnecessary, but neither is being used properly.
public static void RDemo(int n)
{
...
syscall to print n;
syscall to print space
RDemo(n - 1);
syscall to print n;
syscall to print space
}
Let's analyze n, which comes to the top of RDemo in $a0 as per the calling convention.  That register is ok to use for the first syscall to print number n, however, as #Jim points out above, printing the space repurposes $a0 as pointer to text to print, and in doing so wipes out $a0, and so the value of n is ~lost, at least from $a0.
It is necessary for n to survive the printing of the space — and let's note that the function prologue did store n (from $a0) into the stack, which will have survived the syscall to print the space.  So the simplest thing is to reload it from that stack location, after the printing of the space, and before computing n-1 for the recursive call.
lw $a0, 8($sp)
You'll find that the recursive call also clobbers $a0, by passing n-1 in $a0, so you'll need this lw before using n again after the recursive call.  (It might be tempting to add 1 to $a0 to restore its original value, but that would require the callee to preserve its argument register, and that isn't part of the calling convention, i.e. it is a violation of the calling convention.)
In this scenario, $s0, is wholly unnecessary and goes unused, so no need to save and restore it in prologue/epilogue.
Also, while it is necessary to save $a0 in prologue, it is unnecessary to restore it in epilogue: that, $a0, is RDemo's parameter, which, by definition of the calling convention, belongs to this invocation/activation of RDemo and not its caller — restoring upon exit is a benefit to the caller, but not in the case of parameters as these are given to the callee to do with as they please, and callers (should) have no expectation of receiving back the value they provided — this meets the definition the scratch register set, aka call clobbered (and sometimes poorly labeled as caller saves).
As an alternative scenario, we can use the call-preserved register, e.g. $s0 to hold n live across syscalls and other calls.  In this case, n found in $a0 upon function entry, should be copied into $s0, as part of function prologue (but after saving $s0 to the stack so $s0's original value can be restored upon exit).
From there on, whenever the code needs n, it can be found in $s0 (because you put it there).  So, if/when you want it back in $a0, for example, then use move $a0, $s0.
The prologue in this scenario should save the incoming $s0, and restore it in epilogue, but there's no need to save (or restore) $a0.
I'd like to point out that everything of the above discussion applies whether a call is recursive or simply calling an alternate routine, say RDemo calls RDemo2, which calls RDemo3...  Recursion may be hard to follow, but we use the same rules (the rules of the calling convention) just as with any function calling, in other words, recursion adds no additional rules.
Of course, when writing small toy programs we can always invent our own calling convention.  Compilers will sometimes do this when they know internal details of callers and callees.  However, if you're goal is to learn about (1) a standard calling convention, (2) the difference between call preserved and call clobbered registers, then follow the logic and analysis above.
The analysis we need to properly allocate variables and temporaries to the right kind of registers (call-clobbered vs. call-preserved) is a form of live-variable analysis.
In particular, we're looking to see if a variable is live across a function call.  More technically, this analysis looks to match definitions of variables (assignments to that variable) with uses of that variable (consulting the value of the variable) and if some usage is after a call when a reaching definition is before a call, then that variable is live across a (function) call.
When a variable is not live across a function call then we can freely use the scratch/call-clobbered registers, and they are preferred for such variables since they have less prologue and epilogue overhead than call-preserved registers.
However, if a variable is live across a call, then it has special requirements, namely that its value needs to be in function-call-preserved storage, which is either in a call-preserved register, or local stack memory.  Making a good choice between a call-preserved register and local stack memory has to do with the actual usage of that variable.
A call-preserved register adds overhead both prologue and epilogue since its original value must be returned to the caller; however, using local stack space has, at minimum the overhead of initialization.
Often when a variable is used within looping statements that involve function calls, a call-preserved register is better choice than stack, and vice versa: when the variable is not used within looping statements local stack memory is better.
Complete analysis depends though circumstances that do vary based on the actual code involved and of a particular dynamic workload, so we can count instruction and stalls (static and/or dynamic approximation) to compare the two approaches for a given variable.  Compilers do this type of analysis to make their storage choices.

How would you write this code in MIPS Assembly? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to learn assembly on my own, but I'm confused on how to write a recursive function that calls itself more than once in the return statement.
This is the function in C:
int main()
{
int a;
a = rec3(5);
printf("%d", a);
return 0;
}
int recursion(int x) {
if (x > 0) {
return x + recursion(x-1) + recursion(x-2);
}
else {
return 0;
}
}
This is what I've gotten upto so far:
.text
main:
li $v0, 5 #Read in an int
syscall
move $a0, $v0 #Move the int to argument
jal Rec #Call Recursion function
move $a0, $v0 #Print the value
li $v0, 1
syscall
li $v0, 10
syscall
Rec: subu $sp, $sp, 8
sw $ra, 0($sp)
sw $s0, 4($sp)
sw $s1, 8($sp)
Done: lw $ra, 4($sp)
lw $s0, ($sp)
addu $sp, $sp, 8
jr $ra
I haven't written the recursive part of the function because I'm so lost on how to do it. Can anyone write out the recursive part so I have an understanding of how to solve it? I also want to clarify, this isn't a school project or anything like that. I am just trying to understand how to do recursion in MIPS Assembly so I made my own function.

You know you can transform it into this, right?
int tmp1 = recursion(x-1);
int tmp2 = recursion(x-2);
return x + tmp1 + tmp2;
If it was literally having multiple recursive calls "in the return statement" that was confusing for you, does that help?
The first temporary you invent has to be saved somewhere (e.g. a call-preserved register) across the 2nd function call as part of evaluating that expression. Just like any time you need some data to survive across a function call.
The way a compiler would do it is to save/restore a couple call-preserved registers like $s0 and $s1 at the start/end of the function, and use them within the function for x and that temporary.
Or optimize that and only save x + recursion(x-1) in a single register, so you only need that and the return value after the 2nd function call returns.
Of course an optimizing compiler would turn some of this recursion into a loop, and not actually generate assembly that recursed as much. By hand you could even simplify it down to a modified Fibonacci loop with O(n) runtime instead of O(Fib(n)), just keeping the last two sequence values in registers. That's how to implement this function efficiently, but wouldn't teach you about recursion. Unfortunately this function is an example of a case where recursion is inconvenient and the worst way to implement this calculation.
(I mostly mention this because you asked how I would write this code in asm. I'd write asm that had the same observable results as the C, applying the as-if rule like the C standard allows compilers to do. Being recursive doesn't count as an observable result in ISO C. Obviously that's not what you should actually do for an assignment or to learn about recursion.)

Pipelining exercise

We run the above code in a 4-stage IN-ORDER pipeline with F, D, X, W stages, where X takes 4 pipelined cycles for ADD and takes 6 pipelined cycles for MUL. Assume NO forwarding (bypassing), i.e., we need to stall on every data dependency. How many cycles will the code take to execute?.
The code and my answer is attached in the following picture. I think I should use excel because it looks more organized.
The answers are 27,28,29,30. I got 27. Is that right?. what do you get?

it should be 28 cycles takes to execute. Add use 4 pipeline cycle, Mul use 6 pipeline cycle. there are 4 ADD instruction and 2 MUL instruction. which, 4*4 + 2*6 = 28 cycles

hazards in a three-way superscalar pipeline

I am working my way though exercises relating to superscalar architecture. I need some help conceptualizing the answer to this question:
“If you ever get confused about what a register renamer has to do, go back to the assembly code you're executing, and ask yourself what has to happen for the right result to be obtained. For example, consider a three-way superscalar machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has executed?”
I can look at that and see that, ok, the final value of R1 should be 40. How would a three-way superscalar machine reach this answer though? If I understand them correctly, in this three-way superscalar pipeline, these three instructions would be fetched in parallel. Meaning, you would have a hazard right from start, right? How should I conceptualize the answer to this problem?
EDIT 1: When decoding these instructions, the three-way superscalar machine would, by necessity, have to perform register renaming to get the following instruction set, correct:
ADDI R1, R2, R3
ADDI R4, R5, R6
ADDI R1, R2, R3

Simply put - you won't be able to perform these instructions together. However, the goal of this example doesn't seem to do with hazards (namely - detecting that these instruction are interdependent and must be performed serially with sufficient stalls), it's about renaming - it serves to show that a single logical register (R1) will have multiple physical "versions" in-flight simultaneously in the pipeline. The original one would have the value 5 (lets call it "p1"), but you'll also need to allocate one for the result of the first ADD ("p2"), to be used as source for the second, and again for the results of the second and third ADD instructions ("p3" and "p4").
Since this processor decodes and attempts to issue these 3 instructions simultaneously, you can see that you can't just have R1 as the source for all - that would prevent each of them from using the correct mid-calculation value, so you need to rename them. The important part is that p1..p4 as we've dubbed them, can be allocated simultaneously, and the dependencies would be known at the time of issue - long before each of them is populated with the result. This essentially decouples the front-end from the execution back-end, which is important for the performance flexibility in modern CPUs as you may have bottlenecks anywhere.

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter

The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul

Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).

I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?

All references I can find online say that R is incremented once per instruction irrespective of its length.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pipeline hazard handling - pipeline

This is the correct answer that she gave us from yesterday's review session. Can you point me in the right direction on how to answer this question? How do you know when to delay, put X's, etc? I am sorry I can't give exact details of the answer, but just know this is the correct answer.

Related

How to write a MIPS recursion program in descending and then ascending order

How would you write this code in MIPS Assembly? [closed]

Pipelining exercise

hazards in a three-way superscalar pipeline

Z80 memory refresh register

Categories

Resources