Pipelining exercise - pipeline

We run the above code in a 4-stage IN-ORDER pipeline with F, D, X, W stages, where X takes 4 pipelined cycles for ADD and takes 6 pipelined cycles for MUL. Assume NO forwarding (bypassing), i.e., we need to stall on every data dependency. How many cycles will the code take to execute?.
The code and my answer is attached in the following picture. I think I should use excel because it looks more organized.
The answers are 27,28,29,30. I got 27. Is that right?. what do you get?

it should be 28 cycles takes to execute. Add use 4 pipeline cycle, Mul use 6 pipeline cycle. there are 4 ADD instruction and 2 MUL instruction. which, 4*4 + 2*6 = 28 cycles

Related

what is an efficient method for transposing N timing arrays into one master array?

This is an extremely niche problem, so I'll do my best to explain it in words:
Say you have an operation that requires negligible time (in my case, stepping a stepper motor once by pulsing the pins). I want to coordinate the movement of 6 individual motors with their own acceleration curve, but actuate them with the same micro controller. I also want the individual acceleration curves of the motors to be modifiable.
I'm using a Teensy 4.1, so this program will be written in Arduino language (near identical to C++).
My current approach to this problem is to generate six individual "delay" arrays for each motor. Essentially, their speed is controlled by the delay in between each pulse, and the angular distance traveled by the # of steps, or elements of each delay array. Something like this:
1 (P20) 1 (P20) 1 (P20) 1
2 (P30) 2 (P30) 2 (P30) 2
Where a 1 or 2 is the respective motor's step and a (PX) is a delay for X seconds.
I would like to write some master transposition function that turns the above into this:
1,2 (P20) 1 (P10) 2 (P10) 1,2 (P20) 1 (P10) 2
This array, when read by my actuation code would step motors 1 and 2 at the same time, wait 20 microseconds, step motor 1, 10 microseconds, step motor 2... etc.
It seems pretty simple when you do it for two motors, but for some reason I just can't wrap my head around making a completely modular version of this. In my case, it would need to merge 6 arrays into one.
I am also just wondering if anyone could think of a more elegant solution to this problem, as I am pretty new to programming and don't know all the features / capabilities of C++.
I have tried applying an iterative method, where you keep track of the current total delay elapsed and subtract it from the lowest next total delay of the different stepper arrays, then append the master array with that difference and update all the totals accordingly, but this approach always ends up too convoluted for me to follow.
Might I offer a different solution:
Use something like freeRTOS to and make individual threads to control each motor. That way you can define delays for each motor individually without having the extra complication of making a timing array.
The ESP32 has freeRTOS available in the Arduino definitions and is pretty easy to use. You could find a port for the Teensy 4.1
Here is a really nice tutorial for blinking LEDs at different rates. Your code should have a similar approach.
freeRTOS ESP32 tutorial

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

Error in MPI broadcast

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.

Mysterious combination

I decided to learn concurrency and wanted to find out in how many ways instructions from two different processes could overlap. The code for both processes is just a 10 iteration loop with 3 instructions performed in each iteration. I figured out the problem consisted of leaving X instructions fixed at a point and then fit the other X instructions from the other process between the spaces taking into account that they must be ordered (instruction 4 of process B must always come before instruction 20).
I wrote a program to count this number, looking at the results I found out that the solution is n Combination k, where k is the number of instructions executed throughout the whole loop of one process, so for 10 iterations it would be 30, and n is k*2 (2 processes). In other words, n number of objects with n/2 fixed and having to fit n/2 among the spaces without the latter n/2 losing their order.
Ok problem solved. No, not really. I have no idea why this is, I understand that the definition of a combination is, in how many ways can you take k elements from a group of n such that all the groups are different but the order in which you take the elements doesn't matter. In this case we have n elements and we are actually taking them all, because all the instructions are executed ( n C n).
If one explains it by saying that there are 2k blue (A) and red (B) objects in a bag and you take k objects from the bag, you are still only taking k instructions when 2k instructions are actually executed. Can you please shed some light into this?
Thanks in advance.
FWIW it can be viewed like this: you have a bag with k blue and k red balls. Balls of same color are indistinguishable (in analogy with the restriction that the order of instructions within the same process/thread is fixed - which is not true in modern processors btw, but let's keep it simple for now). How many different ways can you pull all the balls from the bag?
My combinatorial skills are quite rusty, but my first guess is
(2k!)
-----
2*k!
which, according to Wikipedia, indeed equals
(2k)
(k )
(sorry, I have no better idea how to show this).
For n processes, it can be generalized by having balls of n different color in the bag.
Update: Note that in the strict sense, this models only the situation when different processes are executed on a single processor, so all instructions from all processes must be ordered linearly on the processor level. In a multiprocessor environment, several instructions can be executed literally at the same time.
Generally, I agree with Péter's answer, but since it does not seem to have fully clicked for the OP, here's my shot at it (purely from a mathematical/combinatorial standpoint).
You have 2 sets of 30 (k) instructions that you're putting together, for a total of 60 (n) instructions. Since each set of 30 must be kept in order, we don't need to track which instruction within each set, just which set an instruction is from. So, we have 60 "slots" in which to place 30 instructions from one set (say, red) and 30 instructions from the other set (say, blue).
Let's start by placing the 30 red instructions into the 60 slots. There are (60 choose 30) = 60!/(30!30!) ways to do this (we're choosing which 30 slots of the 60 are filled by red instructions). Now, we still have the 30 blue instructions, but we only have 30 open slots left. There is (30 choose 30) = 30!/(30!0!) = 1 way to place the blue instructions in the remaining slots. So, in total, there are (60 choose 30) * (30 choose 30) = (60 choose 30) * 1 = (60 choose 30) ways to do it.
Now, let's suppose that instead of 2 sets of 30, you have 3 sets (red, green, blue) of k instructions. You have a total of 3k slots to fill. First, place the red ones: (3k choose k) = (3k)!/(k!(3k-k)!) = (3k)!/(k!(2k)!). Now, place the green ones into the remaining 2k slots: (2k choose k) = (2k)!/(k!k!). Finally, place the blue ones into the last k slots: (k choose k) = k!/(k!0!) = 1. In total: (3k choose k) * (2k choose k) * (k choose k) = ( (3k)! * (2k)! * k! ) / ( k!(2k)! * k!k! * k!0! ) = (3k)!/(k!k!k!).
As further extensions (though I'm not going to provide a full explanation):
if you have 3 sets of instructions with length a, b, and c, the number of possibilities is (a+b+c)!/(a!b!c!).
if you have n sets of instructions where the ith set has ki instructions, the number of possibilities is (k1+k2+...+kn)!/(k1!k2!...kn!).
Péter's answer is fine enough, but that doesn't explain just why concurrency is difficult. That's because more and more often nowadays you've got multiple execution units available (be they cores, CPUs, nodes, computers, whatever). That in turn means that the possibilities for overlapping between instructions is increased still further; there's no guarantee that what happens can be modeled correctly with any conventional interleaving.
This is why it is important to think in terms of using semaphores/mutexes correctly, and why memory barriers matter. That's because all of these things end up turning the true nasty picture into something that is far easier to understand. But because mutexes reduce the number of possible executions, they are reducing the overall performance and potential efficiency. It's definitely tricky, and that in turn is why it is far better if you can work in terms of message passing between threads of activity that do not otherwise interact; it's easier to understand and having fewer synchronizations is better.

Resources