Clock Cycles per add Instruction - pipeline

Let's say I have a simple add instruction which needs to add two numbers. I need one cycle too fetch the instruction, one cycle to decode it, two cycles to fetch the numbers, one cycle to add the two numbers and one cycle to write the outcome back.
So for such instruction I need 6 cycles. If you search the internet, many sources say that addition needs just one cycle. Do they refer just to the execution time, or to the whole pipeline?

Related

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Is it OK for a DirectShow filter to seek the filters upstream from itself?

Normally seek commands are executed on a filter graph, get called on the renderers in the graph and calls are passed upstream by filters until a filter that can handle the seek does the actual seek operation.
Could an individual filter seek the upstream filters connected to one or more of its input pins in the same way without it affecting the downstream portion of the graph in unexpected ways? I wouldn't expect that there wouldn't be any graph state changes caused by calling IMediaSeeking.SetPositions upstream.
I'm assuming that all upstream filters are connected to the rest of the graph via this filter only.
Obviously the filter would need to be prepared to handle the resulting BeginFlush, EndFlush and NewSegment calls coming from upstream appropriately and distinguish samples that arrived before and after the seek operation. It would also need to set new sample times on its output samples so that the output samples had consistent sample presentation times. Any other issues?
It is perfectly feasible to do what you require. I used this approach to build video and audio mixer filters for a video editor. A full description of the code is available from the BBC White Papers 129 and 138 available from http://www.bbc.co.uk/rd
A rather ancient version of the code can be found on www.SourceForge.net if you search for AAFEditPack. The code is written in Delphi using DSPack to get access to the DirectShow headers. I did this because it makes it easier to handle com object lifetimes - by implementing smart pointers by default. It should be fairly straightforward to transfer the ideas to a C++ implementation if that is what you use.
The filters keep lists of the sub-graphs (a section of a graph but running in the same FilterGraph as the mixers). The filters implement a custom version of TBCPosPassThru which knows about the output pins of the sub-graph for each media clip. It handles passing on the seek commands to get each clip ready for replay when its point in the timeline is reached. The mixers handle the BeginFlush, EndFlush, NewSegment and EndOfStream calls for each sub-graph so they are kept happy. The editor uses only one FilterGraph that houses both video and audio graphs. Seeking commands are make by the graph on both the video and audio renderers and these commands are passed upstream to the mixers which implement them.
Sub-graphs that are not currently active are blocked by the mixer holding references to the samples they have delivered. This does not cause any problems for the FilterGraph because, as Roman R says, downstream filters only care about getting a consecutive stream of sample and do not know about what happens upstream.
Some key points you need to make sure of to avoid wasted debugging time are:
Your decoder filters need to be able to queue to the exact media frame or audio time. Not as easy to do as you might expect, especially with compressed formats such as mpeg2, which was designed for transmission and has no frame index in the files. If you do not do this, the filter may wait indefinitely to get a NewSegment call with the correct media times.
Your sub graphs need to present a NewSegment time equal to the value you asked for in your seek command before delivering samples. Some decoders may seek to the nearest key frame, which is a bit unhelpful and some are a bit arbitrary about the timings of their NewSegment and the following samples.
The start and stop times of each clip need to be within the duration of the file. Its probably not a good idea to police this in the DirectShow filter because you would probably want to construct a timeline without needing to run the filter first. I did this in the component that manages the FilterGraph.
If you want to add sections from the same source file consecutively in the timeline, and have effects that span the transition, you need to have two instances of the sub-graph for that file and if you have more than one transition for the same source file, your list needs to alternate the graphs for successive clips. This is because each sub graph should only play monotonically: calling lots of SetPosition calls would waste cpu cycles and would not work well with compressed files.
The filter's output pins define the entire seeking behaviour of the graph. The output sample time stamps (IMediaSample.SetTime) are implemented by the filter so you need to get them correct without any missing time stamps. and you can also set the MediaTime (IMediaSample.SetMediaTime) values if you like, although you have to be careful to get them correct or the graph may drop samples or stall.
Good luck with your development. If you need any more information please contact me through StackOverflow or DTSMedia.co.uk

Chicken or egg in functions/actors processing WorldState

I have read the series "Purely Functional Retrogames"
http://prog21.dadgum.com/23.html
It discusses some interesting techniques to build a (semi-)pure game world update loop.
However, I have the following side remark, that I cannot seem to get my head around:
Suppose you have a system, where every enemy, and every player are separate actors, or separate pure functions.
Suppose they all get a "WorldState" as input, and output a New WorldState (or, if you're thinking in actor terms, send the new WorldState to the next actor, ending with for instance a "Game Render" actor).
Then, there's two ways to go with this:
Either you start with one actor, (f.i. the player), and feed him the "current world".
Then, you feed the new world, to the next enemy, and so on, until all actors have converted the worlds. Then, the last world is the new world you can feed to the render loop. (or, if you followed the above article, you end up with a list of events that have occurred in the world, which can be processed).
A second way, is to just give all actors the current WorldState at the same time. They generate any changes, which may conflict (for instance, two enemies and the player can take a coin in the same animation frame) -> it is up to the game system to solve these conflicts by processing the events. By processing all events, the Game actor creates the new world, to be used in the next update frame.
I have a feeling I'm just confronted with the exact same "race condition" problem I wished to avoid by using pure functions with immutable data.
Any advice here?
I didn't read the article, but with the coin example you are creating a kind of global variable: you give a copy of the world state to all actors and you suppose that each actor will evaluate the game, take decision and expect that their action will succeed, regardless of the last phase which is the conflict solving. I would not call this a race condition but rather a "blind condition", and yes, this will not work.
I imagine that you came to this solution in order to allow parallelism, not available in solution 1. In my opinion, the problem is about responsibility.
The coin must belong to an actor (a server acting as resource manager) as anything in the application. This actor is the only responsible to decide what will happen to the coin.
All requests (is there something to grab, grab it, drop something...) should be sent to this actor (one actor per cell, or per map, level or any split that make sense for the game).
The way you will manage it is then up to you: serve all requests in the receive order, buffer them until a synchro message comes and make a random decision or priority decision... In any case the server will be able to reply to all actors with success or failure without any risk of race condition since the server process is run (at least in erlang) on a single core and proceed one message at a time.
In addition to Pascal answer, you can solve parallelization by splitting (i assume huge map) to smaller chunks which depend on last state (or part of it, like an edge) of its neighbours. This allows you to distribute this game among many nodes.

Single-cycle vs a pipelined approach

I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?
See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

Resources