Pipelining affects the clock time or cycle-per-instruction(CPI)?

Pipelining affects the clock time or cycle-per-instruction(CPI)? - pipeline

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?

Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Related

Completely simultaneous execution of two instructions (RISCV)

The question comes from a RISCV implementation, but I think it may also apply to many other architectures.
From a code with two completely independent instructions in sequence (generic ISA notation):
REG1 = REG2 + REG3
REG4 = REG5 + REG6
In a pipelined implementation, assuming there are no other hazards (simultaneous r/w access to the registers is possible and there are two independent adders), is it a violation of the ISA if the two instructions are executed completely in parallel?
In other words, at the same clock edge, can the 3 registers (REG1, REG4 and PC) be updated at once (PC+8 for the RISCV-32 example)?

No, clearly there's no problem, since real CPUs do this all the time. (e.g. Intel since Haswell can run 4 independent add instructions per clock: https://www.realworldtech.com/haswell-cpu/4/ https://uops.info/ https://agner.org/optimize/).
It only has to maintain the illusion of having run instructions one at a time, following the ISA's sequential execution model. The same concept as the C "as-if" rule applies.
If the ISA doesn't guarantee anything about timing, like that you can delay N clock cycles with N nop or other instructions, nothing stops a specific implementation from doing as much work as possible in a clock cycle. (Some microcontrollers do have specific timing guarantees or specifications, so code can delay for N cycles with delay loops. Or at least specific implementations of some ISAs have such guarantees.)
It's 100% normal for modern CPUs to average more than 1 instruction per clock, despite stalling sometimes on cache misses and branch mispredicts, so that clearly means fetching, decoding, and executing multiple instructions per clock cycle in other cycles. See also Modern Microprocessors
A 90-Minute Guide! for some basics of superscalar in-order and out-of-order pipelines.

Arguing whether a situation leads to data hazard or not

I was going through the section of pipelining from the text Computer Organization [5e] by Hamacher et. al.. There I came across a situation which the authors claim causes data hazard.
The situation is shown below:
For example, stage E in the four-stage pipeline of Figure 8.2b is responsible for arithmetic and logic operations, and one clock cycle is assigned for this task. Although this may be sufficient for most operations, some operations, such as divide, may require more time to complete. Figure 8.3 shows an example in which the operation specified in instruction I2 requires three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage must be told to do nothing, because it has no data to work with. †: Meanwhile, the information in buffer B2 must remain intact until the Execute stage has completed its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting new instructions because the information in B1 cannot be overwritten. Thus, steps D4 and F5 must be postponed as shown.
... Any condition that causes the pipeline to stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline.
In the example above, the authors assume that a data hazard has occurred, and two stall cycles are introduced into the pipeline. The main reason that they give for this data hazard is that, since the execute phase requires 2 more cycles than the usual need for instruction 2, so the data on which the write back stage should work has to wait for 2 cycles...
But I am having a little difficulty in accepting this analysis. Usually, the books give examples of data hazards in situations, where there is data dependency (the usual RAW, WAR, etc..). But here there is no such thing. And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Moreover, the text assumes that there is no queuing of the results of the stages in the buffer. Clear from the statement marked with †, Meanwhile, the information in the buffer..., (where there is a little flaw as well, because, if no queuing is there, then the output of D3 in cycle 4 shall overwrite the value in buffer B2 on which the EX stage is working, a contradiction to their own assumption).
I thought that the stalls are introduced due to this no queuing condition... and structural hazard, and if things are properly managed as shown below, no stalls shall be there.
This is what I assume:
I assume that the execute stage has more than one separate functional units (e.g. one where calculations of instruction 1 are performed. [basic ALU requiring 1 cycle duration], one for integer division, another for integer multiplication etc.) [So structural hazard is out of the way now.]
I also assume that the pipeline buffers can store the results produced in the stages in a queue. [So that the problem in statement marked with † is no longer there.]
This being said, the situation is now as follows:
However hard I tried with the assumptions, I could not remove the bubbles shown in blue. [Even if queuing is assumed in buffers, the buffers cannot give the result out of order, so those stalls remain].
With this exercise of mine, I feel that the example shown in the text is indeed a hazard and that too data hazard (even though there was no data dependencies ?), as in my exercise there was no chance of structural hazard...
Am I correct?

And I thought this to be a structural hazard assuming that I2 cannot use the EX stage as I1 is using it.
Yup, that's the terminology I'd use, based on wikipedia: https://en.wikipedia.org/wiki/Hazard_(computer_architecture).
That article restricts data hazards to only RAW, WAR, and WAW. As such, they're only visible when you consider which operands are being used.
e.g. an independent multiply (result not read by the next few insns) could be allowed to complete out of order, after executing in a separate multi-cycle or pipelined multiplier unit.
Write-back conflicts would be a problem if the slow ALU instruction needed to write a GPR in the same cycle as a later add or something. Also data hazards like WAW, since mul r3, r2, r1 / sw r3, (r4) / add r3, r2, r1 should leave r3 = r2+r1 not r2*r1.
MIPS solved all that with the special hi:lo reg pair for mult/div, allowing the mul and div units to be loosely coupled to the 5-stage pipeline. And the ISA had pretty relaxed rules about what was allowed to happen to those registers, e.g. writing one with mthi r3 destroys the previous value of the other, so mflo r2 would give unpredictable results after mthi. Raymond Chen's article.
An "in-order pipeline" means instructions start execution in program order, no necessarily that they complete in program order. It's very common for modern in-order pipelines to allow memory operations to complete out of order, allowing memory-level parallelism and allowing instruction scheduling to hide load-use latency of L1d cache hits. It's also possible to pipeline higher-latency ALU operations as long as hazards are detected and handled somehow.
Do these authors use the term "structural hazard" at all, or do they consider all (non-control?) hazards to be data hazards?
At this point it seems like primarily a terminology issue. IDK if they're on their own in using terminology this way, or if there is another convention with any popularity other than the one Wikipedia describes.
Separate from your main question, In clock cycles 4 and 5, you have two instructions in the E stage at the same time. If something stalls in the E stage, the stall bubbles need to come before the E stage in later instructions, like in the Fig 8.3 image you linked from the book.
And yeah, it's weird that they talk about the pipeline register between stages needing to stay constant. If a multi-cycle non-pipelined execution unit needs to keep values around, it could snapshot them.
Unless maybe the stall signal makes the Decode stage keep generating that output repeatedly until the stall signal is de-asserted and the pipeline register will finally latch the output of the previous stage instead of ignoring it. There are latches / flip-flops that have a control signal separate from the clock that makes them ignore their input and keep outputting what they were already outputting.

CPU memory access time

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?

I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.

Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

Understanding CYCLE_ACTIVITY.* Haswell Performance-Monitoring Events

I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P < CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING?
CYCLE_ACTIVITY.CYCLES_L2_PENDING > CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING > CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING being 100x or even 1000x greater than *_L1D_PENDING it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES instead of STALLS. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26

I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L1D_PENDING, respectively.
First, note that the formula for %L2_Bound is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:
This section covers various performance tuning techniques using
performance monitoring events. Some techniques can be adapted in
general to other microarchitectures, most of the performance events
are specific to Intel microarchitecture code name Sandy Bridge.
My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache
miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2
miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to
L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of
loads missed L2.
The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss
pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss
pending demand load and no uops dispatched on this thread, increment
by 1.
There are three important differences:
Some of the Haswell events can only valid when HT is disabled. All SNB events are valid even when HT is enabled.
CYCLE_ACTIVITY.STALLS_L2_PENDING on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.
The HSW events include all accesses, not just demand loads. In contrast, the SNB events only occur for demand loads.
On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING, and that's why the %L2_Bound formula is valid on SNB.
Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.
Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING:
HSM63: The intended behavior of CYCLES_L2_PENDING on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.
HSM80: CYCLES_L2_PENDING may overcount due to requests from the next page prefetcher.
I think you can minimize the error in CYCLES_L2_PENDING by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.
Related: When L1 misses are a lot different than L2 accesses… TLB related?
Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLES_LDM_PENDING. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.

Single-cycle vs a pipelined approach

I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?

See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex