hazards in a three-way superscalar pipeline - pipeline

I am working my way though exercises relating to superscalar architecture. I need some help conceptualizing the answer to this question:
“If you ever get confused about what a register renamer has to do, go back to the assembly code you're executing, and ask yourself what has to happen for the right result to be obtained. For example, consider a three-way superscalar machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has executed?”
I can look at that and see that, ok, the final value of R1 should be 40. How would a three-way superscalar machine reach this answer though? If I understand them correctly, in this three-way superscalar pipeline, these three instructions would be fetched in parallel. Meaning, you would have a hazard right from start, right? How should I conceptualize the answer to this problem?
EDIT 1: When decoding these instructions, the three-way superscalar machine would, by necessity, have to perform register renaming to get the following instruction set, correct:
ADDI R1, R2, R3
ADDI R4, R5, R6
ADDI R1, R2, R3

Simply put - you won't be able to perform these instructions together. However, the goal of this example doesn't seem to do with hazards (namely - detecting that these instruction are interdependent and must be performed serially with sufficient stalls), it's about renaming - it serves to show that a single logical register (R1) will have multiple physical "versions" in-flight simultaneously in the pipeline. The original one would have the value 5 (lets call it "p1"), but you'll also need to allocate one for the result of the first ADD ("p2"), to be used as source for the second, and again for the results of the second and third ADD instructions ("p3" and "p4").
Since this processor decodes and attempts to issue these 3 instructions simultaneously, you can see that you can't just have R1 as the source for all - that would prevent each of them from using the correct mid-calculation value, so you need to rename them. The important part is that p1..p4 as we've dubbed them, can be allocated simultaneously, and the dependencies would be known at the time of issue - long before each of them is populated with the result. This essentially decouples the front-end from the execution back-end, which is important for the performance flexibility in modern CPUs as you may have bottlenecks anywhere.

Related

Flipping coin simulation with gain/loss

Suppose you successively toss a fair coin and each time the result is
heads, you win $1, while if you get tails you lose 1$. Your initial capital is
3$. The throws stop if your capital is zeroed or you reach 10$. Let X_n be the
process that describes your chapter during the nth throw.
Simulate the X_n process 1000 times and present the graph
of its evolution through R.
2. Estimate the average number of consecutive throws until you stop. Is the result expected?
Can someone help me solve this or at least understand the steps I am supposed to take?
Someone already posted a link to a solution of your homework in the comments. I fear, however, that this uncommented code is incomprehensive for you, given that you have asked the question in the first place.
I would therefore suggest to first write your own implementation with an outer for loop and an inner while loop conditioned upon the running capital, call rbinom in each run and recompute the running capital. Store the resulting runs in a numeric vector and call mean on this vector.
It will start becoming interesting when you measure the runtime of your solution, which will be surprisingly slow. To speed it up, you must use "vectorization", which the linked to solution uses, but this is a completely different topic to be left for a different lesson...

Pipelining with bypassing

I am trying to understand the concept of bypassing by reading the following slide
Bypassing is reading a value from an intermediate source. What does the arrow stand for?, does it mean that X is executed after M in the sequence?. How does it work?
Bypassing means the data at that stage is passed to the stage required. For example in the first case (MX bypass),
the output of the operation ADD r2, r3 is available at the M stage, but has not written back to its destination r1. The SUB instruction is expecting one of its data to be available at r1. Since this r1 data is produced by the ADD and "we" know that it is this same r1 is needed for SUB we dont need to wait until the writeback stage W of ADD is complete. "We" can simply bypass the data to the SUB instruction. The same goes with WX bypass as well.

Immediatea Addressing mode used in instructions containing memory locations

Suppose we have:
MOV #NUM, R0
I understand that the hashtag represents an immediate addressing mode. However, what I don't understand what exactly gets stored in R0 in this case. Is it the actual address of NUM?

ARMv8 Foundation Model: switches and leds

I am trying to boot my small ARMv7 kernel (which runs just fine using qemu vexpress model) in ARMv8 Foundation Model v2.1. The model boots at level EL3 / 64 bits, and I managed to go down to level EL1 / 32 bits, but I encounter some issues (in a few words, the timer doesn't tick and some kprintf are missing, but that's not the issue here).
To debug my UART issue, I wanted to use the led / switches provides by the model. I can read their value from software quite easily, but I can't write a new value to either of them. The kernel seems to hang. Here is a minimal asm code that writes to the switches register:
.global Start
Start:
# we are in EL3 / 64 bits mode
# create the 0x1C010000 + 0x4 address of switches
mov x0, #4
movk x0, #0x1c01, lsl #16
# value to write
mov w1, #0xaa
# actual writing
strb w1, [x0]
It seems I am stuck at the strb instruction. For the record, if I replace strb with ldrb, I can correctly read and display the value of this register (I played with the --switches flag to be sure it worked).
Any one knows what I am doing wrong here ?
EDIT: thanks to unixsmurf suggestions, I know now that I got an synchronous Data Abort Exception with no level change, and that the reason is "Synchronous External Abort". I don't know how to inspect further, I guess I'll try ARM's forum.
Best,
V.
The ARM community finally solved the problem. The complete discussion can be found here.

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.

Resources