Pipelining with bypassing

Pipelining with bypassing - pipeline

I am trying to understand the concept of bypassing by reading the following slide
Bypassing is reading a value from an intermediate source. What does the arrow stand for?, does it mean that X is executed after M in the sequence?. How does it work?

Bypassing means the data at that stage is passed to the stage required. For example in the first case (MX bypass),
the output of the operation ADD r2, r3 is available at the M stage, but has not written back to its destination r1. The SUB instruction is expecting one of its data to be available at r1. Since this r1 data is produced by the ADD and "we" know that it is this same r1 is needed for SUB we dont need to wait until the writeback stage W of ADD is complete. "We" can simply bypass the data to the SUB instruction. The same goes with WX bypass as well.

Related

Using Fortran90 and MPI, new to both, trying to use MPI_Gather to collect from a loop 3 different variables in each process

I am new to both Fortran90 and MPI. I have a loop that iterates different based on each individual process. Inside of that, I have a nested loop, and it is here that I make the computations that I desire along with the elements of the respective loops. However, I want to send all of this data, the x, the y, and the computed values using x and y, to my root process, 0. From here, I want to write all of the data to the same file in the format of 'x y computation'.
program fortranMPI
use mpi
!GLOBAL VARIABLE DECLARATION
real :: step = 0.5, x, y, comput
integer :: count = 0, finalCount = 5, outFile = 20, i
!MPI
integer :: ierr, myrank, mysize, status(MPI_STATUS_SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myrank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,mysize,ierr)
if(myrank == 0) then
!I want to gather my data here?
end if
do i = 1, mysize, 1
if(myrank == i) then
x = -2. + (myrank - 1.)*step
do while (x<= 2.)
y= -2.
do while (y<=2.)
!Here is where I am trying to send my data!
y = y + step
end do
x = x + (mysize-1)*(step)
end do
end if
end do
call MPI_FINALIZE(ierr)
end program fortranMPI
I keep getting stuck trying to pass the data! If someone could help me out, that would be great! Sorry if this is simpler than I am making it, I am still trying to figure Fortran/MPI out. Thanks in advance!

First of all the program seems to not have any sense on what its doing. If you can be more specific on what you want to do, i can help further.
Now, usually, before the calculations, this if (myrank==0) statement, is where you send your data on the rest of the processes. Since process 0 will be sending data, you have to add code right after that in order for the processes to receive the data. Also you may need to add an MPI_BARRIER (call MPI_BARRIER), right before the start of the calculations, just to make sure the data has reached every process.
As for the calculation part, you also have to decide, not only where you send data, but also where the data is received and if you also need any synchronization on the communication. This has to do with the design of your program so you are the one who knows what exactly you want to do.
The most common commands for sending and receiving data are MPI_SEND and MPI_RECV.
Those are blocking commands, which means that the communication should be synchronized. One Send command should be matched with one Receive command before both processes can continue.
There are also non blocking commands, you can find them all here:
http://www.mpich.org/static/docs/v3.1/www3/
As for the MPI_GATHER command, this is used in order to gather data from a group of processes. This will only help you when you are going to use more than 2 processes to further accelerate your program. Except from that MPI_GATHER is used when you want to gather data and store them in an array fashion, and of course it's worth using only when you are going to receive lots of data which is definitely not your case here.
Finally about printing out results, i'm not sure if what you are asking is possible. Trying to open the same file handle using 2 processes, is probably going to lead to OS errors. Usually for printing out the results, you have rank 0 to do that, right after every other process has finished.

Strange behavior when implementing Back propagation in DBN

Currently I'm trying to implement the Deep Belief Network. But I've met a very strange problem. My source code can be found here: https://github.com/mistree/GoDeep/blob/master/GoDeep/
I first implemented the RBM using CD and it works perfectly (by using the concurrency feature of Golang it's quite fast). Then I start to implement a normal feed forward network with back propagation and then the strange thing happens. It seems very unstable. When I run it with xor gate test it sometimes fails, only when I set the hidden layer nodes to 10 or more then it never fails. Below is how I calculate it
Step 1 : calculate all the activation with bias
Step 2 : calculate the output error
Step 3 : back propagate the error to each node
Step 4 : calculate the delta weight and bias for each node with momentum
Step 1 to Step 4 I do a full batch and sum up these delta weight and bias
Step 5 : apply the averaged delta weight and bias
I followed the tutorial here http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
And normally it works if I give it more hidden layer nodes. My test code is here https://github.com/mistree/GoDeep/blob/master/Test.go
So I think it should work and start to implement the DBN by combining the RBM and normal NN. However then the result becomes really bad. It even can't learn a xor gate in 1000 iteration. And sometimes goes totally wrong. I tried to debug with that so after the PreTrain of DBN I do a reconstruction. Most times the reconstruction looks good but the back propagation even fails when the preTrain result is perfect.
I really don't know what's wrong with the back propagation. I must misunderstood the algorithm or made some big mistakes in the implementation.
If possible please run the test code and you'll see how weird it is. The code it self is quite readable. Any hint will be great help.Thanks in advance

I remember Hinton saying you cant train RBM's on an XOR, something about the vector space that doesnt allow a two layer network to work. Deeper networks have less linear properties that allow it to work.

hazards in a three-way superscalar pipeline

I am working my way though exercises relating to superscalar architecture. I need some help conceptualizing the answer to this question:
“If you ever get confused about what a register renamer has to do, go back to the assembly code you're executing, and ask yourself what has to happen for the right result to be obtained. For example, consider a three-way superscalar machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has executed?”
I can look at that and see that, ok, the final value of R1 should be 40. How would a three-way superscalar machine reach this answer though? If I understand them correctly, in this three-way superscalar pipeline, these three instructions would be fetched in parallel. Meaning, you would have a hazard right from start, right? How should I conceptualize the answer to this problem?
EDIT 1: When decoding these instructions, the three-way superscalar machine would, by necessity, have to perform register renaming to get the following instruction set, correct:
ADDI R1, R2, R3
ADDI R4, R5, R6
ADDI R1, R2, R3

Simply put - you won't be able to perform these instructions together. However, the goal of this example doesn't seem to do with hazards (namely - detecting that these instruction are interdependent and must be performed serially with sufficient stalls), it's about renaming - it serves to show that a single logical register (R1) will have multiple physical "versions" in-flight simultaneously in the pipeline. The original one would have the value 5 (lets call it "p1"), but you'll also need to allocate one for the result of the first ADD ("p2"), to be used as source for the second, and again for the results of the second and third ADD instructions ("p3" and "p4").
Since this processor decodes and attempts to issue these 3 instructions simultaneously, you can see that you can't just have R1 as the source for all - that would prevent each of them from using the correct mid-calculation value, so you need to rename them. The important part is that p1..p4 as we've dubbed them, can be allocated simultaneously, and the dependencies would be known at the time of issue - long before each of them is populated with the result. This essentially decouples the front-end from the execution back-end, which is important for the performance flexibility in modern CPUs as you may have bottlenecks anywhere.

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter

The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul

Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).

I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?

All references I can find online say that R is incremented once per instruction irrespective of its length.

How do I detect circular logic or recursion in a custom expression evaluator?

I've written an experimental function evaluator that allows me to bind simple functions together such that when the variables change, all functions that rely on those variables (and the functions that rely on those functions, etc.) are updated simultaneously. The way I do this is instead of evaluating the function immediately as it's entered in, I store the function. Only when an output value is requested to I evaluate the function, and I evaluate it each and every time an output value is requested.
For example:
pi = 3.14159
rad = 5
area = pi * rad * rad
perim = 2 * pi * rad
I define 'pi' and 'rad' as variables (well, functions that return a constant), and 'area' and 'perim' as functions. Any time either 'pi' or 'rad' change, I expect the results of 'area' and 'perim' to change in kind. Likewise, if there were any functions depending on 'area' or 'perim', the results of those would change as well.
This is all working as expected. The problem here is when the user introduces recursion - either accidental or intentional. There is no logic in my grammar - it's simply an evaluator - so I can't provide the user with a way to 'break out' of recursion. I'd like to prevent it from happening at all, which means I need a way to detect it and declare the offending input as invalid.
For example:
a = b
b = c
c = a
Right now evaluating the last line results in a StackOverflowException (while the first two lines evaluate to '0' - an undeclared variable/function is equal to 0). What I would like to do is detect the circular logic situation and forbid the user from inputing such a statement. I want to do this regardless of how deep the circular logic is hidden, but I have no idea how to go about doing so.
Behind the scenes, by the way, input strings are converted to tokens via a simple scanner, then to an abstract syntax tree via a hand-written recursive descent parser, then the AST is evaluated. The language is C#, but I'm not looking for a code solution - logic alone will be fine.
Note: this is a personal project I'm using to learn about how parsers and compilers work, so it's not mission critical - however the knowledge I take away from this I do plan to put to work in real life at some point. Any help you guys can provide would be appreciated greatly. =)
Edit: In case anyone's curious, this post on my blog describes why I'm trying to learn this, and what I'm getting out of it.

I've had a similar problem to this in the past.
My solution was to push variable names onto a stack as I recursed through the expressions to check syntax, and pop them as I exited a recursion level.
Before I pushed each variable name onto the stack, I would check if it was already there.
If it was, then this was a circular reference.
I was even able to display the names of the variables in the circular reference chain (as they would be on the stack and could be popped off in sequence until I reached the offending name).
EDIT: Of course, this was for single formulae... For your problem, a cyclic graph of variable assignments would be the better way to go.

A solution (probably not the best) is to create a dependency graph.
Each time a function is added or changed, the dependency graph is checked for cylces.
This can be cut short. Each time a function is added, or changed, flag it. If the evaluation results in a call to the function that is flagged, you have a cycle.
Example:
a = b
flag a
eval b (not found)
unflag a
b = c
flag b
eval c (not found)
unflag b
c = a
flag c
eval a
eval b
eval c (flagged) -> Cycle, discard change to c!
unflag c

In reply to the comment on answer two:
(Sorry, just messed up my openid creation so I'll have to get the old stuff linked later...)
If you switch "flag" for "push" and "unflag" for "pop", it's pretty much the same thing :)
The only advantage of using the stack is the ease of which you can provide detailed information on the cycle, no matter what the depth. (Useful for error messages :) )
Andrew

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex