Completely simultaneous execution of two instructions (RISCV) - pipeline

The question comes from a RISCV implementation, but I think it may also apply to many other architectures.
From a code with two completely independent instructions in sequence (generic ISA notation):
REG1 = REG2 + REG3
REG4 = REG5 + REG6
In a pipelined implementation, assuming there are no other hazards (simultaneous r/w access to the registers is possible and there are two independent adders), is it a violation of the ISA if the two instructions are executed completely in parallel?
In other words, at the same clock edge, can the 3 registers (REG1, REG4 and PC) be updated at once (PC+8 for the RISCV-32 example)?

No, clearly there's no problem, since real CPUs do this all the time. (e.g. Intel since Haswell can run 4 independent add instructions per clock: https://www.realworldtech.com/haswell-cpu/4/ https://uops.info/ https://agner.org/optimize/).
It only has to maintain the illusion of having run instructions one at a time, following the ISA's sequential execution model. The same concept as the C "as-if" rule applies.
If the ISA doesn't guarantee anything about timing, like that you can delay N clock cycles with N nop or other instructions, nothing stops a specific implementation from doing as much work as possible in a clock cycle. (Some microcontrollers do have specific timing guarantees or specifications, so code can delay for N cycles with delay loops. Or at least specific implementations of some ISAs have such guarantees.)
It's 100% normal for modern CPUs to average more than 1 instruction per clock, despite stalling sometimes on cache misses and branch mispredicts, so that clearly means fetching, decoding, and executing multiple instructions per clock cycle in other cycles. See also Modern Microprocessors
A 90-Minute Guide! for some basics of superscalar in-order and out-of-order pipelines.

Related

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers?

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

what are the advantages of implementing register in microcontroller architectures i.e. load store architecture

Major difference in RISC and CISC is that in RISC we must need to use registers to do any arithmetic or logic operation. But in case of CISC we can do such operation directly with memory locations. So what is the advantage of implementing register banking in micro controller architectures? Question is not for the advantage of RISC but the question is for what is need of register in RISC architecture. As in other architecture CISC operation can be done directly with meomery location we don't need to take it in register and then again move into the memory location. Below is the example:
CISC: MUL A,B
RISC:
LDA R0,A
LDA R1,B
MUL R0,R1
STR A,R0
So in above example what is the advantage of using R0 and R1 ie. registers. what is the advantage of load store architecture?
Register banking is something else, I assume you are simply asking about using a register directly or not. Well the memory access takes an eternity, even if cached. Several to hundreds of clock cycles for each of the operands where in RISC if you are assuming a pure register based scheme which not all are, the lines are getting fuzzy. With CISC if microcoded it is going to registers anyway, then the operation is happening, if not microcoded then it still gets latched into internal temporary storage (registers) and then the operation can begin. With risc you have a couple-three extra, simpler, instructions the latching to registers takes the same amount of time as it does in CISC. Now if the algorithm never uses that result or does not use it for a while, it might be a win for CISC (if not microcoded) but if the value is an intermediate value in an algorithm then a clear win for RISC. Even if everything is cached it is a half a dozen to dozen clock cycles to get each parameter and write it back, any cache misses and it is an eternity. Same for RISC but with more registers, and significantly faster access to those registers, zero or one clock for each value and to store back, for some percentage if not the whole algorithm.
As with any benchmarking it is trivial to show a RISC winning case and to show a CISC winning case.
The major difference between RISC and CISC is CISC are complicated time consuming instructions where RISC they are much simpler, you arrange the tasks you need to do and have tighter control over those tasks, you dont have a lot of waste per step. One could argue caches were created to deal with the inefficiencies of CISC or at least one popular one. Both benefit sure, but one relies on the other doesnt as much. Trivial to show CISC winning code and trivial to show RISC winning code. Same goes for VLIW, and others.
RISC designs are simpler, smaller, pipes can be shorter, compiler has more control over the performance, etc. So with microcontrollers you can have a very nice processor core with a 3 stage pipeline that is really low power and still quite efficient. The 6502, z80, 8051, etc have really died off for the most part, you still do see a lot of 8051s if you are looking, the desktop/laptop you might be reading this with probably has one 8051, but that is due to royalties and not because of its size or performance, you probably have several to dozens of ARM cores for every x86, within the same box or certainly around the house. A CISC is going to be relatively massive and inefficient, it might be possible to get the power consumption down to RISC levels, that may just be a matter of design and not CISC vs RISC, but the RISC implementations are doing a much better job at watts per mhz than the CISC implementations.
Using registers can simplify the operand fetching logic of functional unis. With CISC functional units should be able to fetch data from memory. With RISC, all the functional units will operate on registers as it is guaranteed that the data will be there, so less complicated.
Also, think of a case where you have multiple MUL operations some uses data at location A, some use B, shown below.
'MUL A, B'
'MUL C, B'
When you perform the operation in CISC, you will be reading B, twice. But in RISC, you load it to a register once, and can use multiple times. So less memory (cache) accesses.
Also think of number of bits needed to represent that MUL in CISC. As A, B, C can be memory locations, they could be anywhere within your address spaces. On the other hand with registers in RISC, bits needed to represent your operands are less, hence less complicated instruction set.
As from above responses, we can conclude that the using registers instead of direct memory location gives the benefit in efficiency in terms of clock cycle and so the power consumption. They also give the benefit in term of complexity of instructions.

What does CPU frequency represent in an Arduino board?

I'm new to Arduino and microcontrollers.
I was studying the specs and found that even the same board may have different frequencies with different input voltages (3.3V vs 5V). So the question is, what does frequency represent? Does it represent how many lines of assembly code it's able to run? Or the maximum PWM frequencies it's able to output?
A further question would be, if I'm looking for a board for a specific project, how do I decide which frequency I will need a priori, instead of trying everything out and see which one works?
What makes me more confused is that when it comes to computer CPUs, it seems that lower frequency CPUs can actually run faster than higher frequency ones (e.g. Intel). So how do I actually know how fast a microcontroller can run?
By frequency we mean the frequency of the CPU clock. Say your Arduino Uno runs on 16 MHz, which is 16,000,000 Hertz.
That means there are 16 Million clock cycles per second. The CPU executes the byte-code of the program. One Assembler instruction can actually take any number of CPU cycles to execute, usually between 1 and 4 cycles for simple stuff, and a little bit more heavy arithmetic and writing to memory. So it's a rough estimate of how many "lines of assembler" (that is, byte-code instructions) it can run per second. A measurement which is a little bit better is the "MIPS" value, the "Millions of instructions per second". There are other benchmark types for CPUs which are more accurate.
If you take at the datasheet for the AVR microprocessor architecture, you can see the cycles that each instructions needs: (link: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf)
So for an ADD Rd, Rr instruction, an AVR CPU needs 1 clock cycle.
Take a desktop Intel CPU for example. It's common these days that they have a clock frequency of 2 GHz or more, which is 2 Billion cycles per second compared to the 16 million cycles per second on the Arduino's AVR CPU. So the Intel CPU beats the Arduino by far. Then again, the Arduino is designed for completley different stuff - it's a small microcontroller with low overhead, runs no OS etc. The use-case for such a CPU (and the architecture) is just different, which makes comparing them unjustified. There many other factors in play, like multi-core CPUs (4 Intel CPU cores vs. 1 AVR) and command pipelining, the speed of your memory / RAM, etc. It's really hard to compare a CPU to another one in every use-case possible, but for "general purpose computing", the Desktop CPUs (AMD, Intel, x64 architecture) far outruns the processing power of a mere Arduino AVR CPU.
I hope this clears up some confusion.
I think one confusion you have may be chip specific, I am not going to look it up right now but I do remember seeing this, the chip spec may say that for this input voltage range it can handle this frequency and for this voltage range it cannot. I think sparkfun the 3.3 are 8mhz and 5.0 are 16mhz or something like that. Anyway, that is not generally the case, but it is a chip by chip vendor by vendor thing and that is why you have to read the datasheet. Has nothing to do with arduinos or avrs specifically, just a general chip design thing.
How do you know how fast your microcontroller can run? That is a very loaded question, depends on your definition of fast. If it is simply what clock frequencies can I use, well "just read the datasheet" for that part, and then depending on your board design choose from what is available, if you do not have any external clocks then your choices may or may not be more limited, you may or may not have a pll that you can use to multiply the clock source.
if your definition of fast is how fast can I perform this task, how many whatevers per second or how much wall clock time does it take to complete some specific task. Well that is a benchmark problem and there are so many variables that there is actually no real answer. Yes it is very true that an x86 can have a lower clock and run faster than some other x86, historically the newer ones can do less stuff per clock than older ones for the same binaries, you have to then tune the compile to the newer chip and then you might get back some of your mips to mhz. but that is in part because you are using a different chip design that just speaks the same language (machine code). You can have a tall person that can recite a poem faster than a short person, both using english and the same poem, has nothing to do with them being short or tall, just that they are different humans.
There are different avr core variations but not remotely on par with the different x86 architectures. so while comparing a tiny vs an xmega you can probably have the xmega run "faster" at the same clock rate simply because it has more registers or a bigger address space, etc. But instructions per second is probably not really different, could be, but my guess is not so much.
Then there is the compiler, the compiler plays a huge role in how "fast" your code runs, change compilers or compiler versions or compiler settings and the machine code produced from the same high level source code (C for example) can vary greatly and as a result can have dramatic effects on the "speed" of the code. Take the dhrystone for example, very easy to demonstrate that the same exact source code on the same exact chip/board, same clock rates, etc can execute at vastly different speeds based on either using different compilers, versions or command line settings, kinda proving that the godfather of benchmarks is basically useless in providing any meaningful information.
Microcontrollers make the problem much worse as you often are running the program out of flash, and many, not all, but many have the ability to either divide or multiply or both the clock, but the flash is not always designed for the full range. You might have a chip that boots on an internal clock at 8mhz but you can use the pll to multiply that up to say 80mhz. But not uncommon that the flash is limited to say 16mhz on a chip like that so at 8mhz the flash can deliver an item say an instruction every cpu clock, but at 20mhz you have to put a wait state and although the cpu is running much faster you can only feed it at 16mhz so it is waiting around more, and then acts fast when it gets something, is it really "faster" or is clocking up making you slower. Certainly at just under 16mhz in this fantasy chip I am describing you can keep it to zero wait states so it is really faster, not necessarily twice as as there are other factors, but definitely faster than 8mhz. just at or above 16mhz though you take a huge performance hit compared to just under 16mhz. at just under 32mhz though it is pretty fast compared to just under, then at just over 32mhz another wait state setting and much slower again even though the clock is basically the same and so on.
Then there is the fetching, how does the cpu actually fetch, like an arm where it fetches a bunch of data per fetch transaction, even if it is not going to execute all of them if you branch to 0x1004 and at that address there is a branch to 0x2008 the core might fetch 0x10 bytes from 0x1000 to 0x100F, THEN extract the 0x1004 word/instruction, decode it to find it is a branch then read 0x10 bytes from 0x2000. basically reading 0x20 bytes to find 2 instructions. Take two instructions if both are in the 0x10 bytes then good if one is at 0x100C and the other at 0x2000 that is a performance hit. take this internal information and apply it to an application and all of its jumping around, changing one line of code or adding or removing a single nop to the bootstrap (causing the alignment of the program to change in the address space) can cause anywhere from a tiny to a large change in performance, swap two helper function sin the source code of your program, in the text, causing them to land in different address spaces once compiled, can have little to major performance effects without actually changing the functions themselves.
So performance is first of a foolish task to go after in one respect, in the other respect all that matters is your program as written with the compiler you are using on the hardware you are using, it is as fast as it is, and there are things you can do to make that code faster on that compiler on that target on that day, by changing compiler settings or the code or both. And ideally you build your final firmware, performance test that, and never build again as if you build a year or two later it may be on a different host compiler with a different compiler or compiler version and all bets are off on performance.
How do you pick a board, how much flash, ram, features, clock rate. A lot of it is experience by just trial and error, you fortunately live in a time where you can literally try hundreds of boards all of which cost anywhere from a few bucks to like 10 or 20 each, different cpu architectures different chip vendors, etc. there are many compiler choices and even languages available, basically there are too many easy to acquire choices, unlike back in the day when the parts were pretty cheap but you may have had to build your own board, write your code in asm, maybe even create your own assembler, etc. Have a rom programmer that cost hundreds to thousands of dollars. So go with the AVR you have and play with its features, play with the compiler and/or write or both. Do experiments to see if there are fetch effects or not. If you have clocking choices mess with those see what happens.
Of course all of this starts with reading the chip documentation from the vendor.

Understanding CYCLE_ACTIVITY.* Haswell Performance-Monitoring Events

I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P < CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING?
CYCLE_ACTIVITY.CYCLES_L2_PENDING > CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING > CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING being 100x or even 1000x greater than *_L1D_PENDING it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES instead of STALLS. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26
I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L1D_PENDING, respectively.
First, note that the formula for %L2_Bound is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:
This section covers various performance tuning techniques using
performance monitoring events. Some techniques can be adapted in
general to other microarchitectures, most of the performance events
are specific to Intel microarchitecture code name Sandy Bridge.
My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache
miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2
miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to
L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of
loads missed L2.
The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss
pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss
pending demand load and no uops dispatched on this thread, increment
by 1.
There are three important differences:
Some of the Haswell events can only valid when HT is disabled. All SNB events are valid even when HT is enabled.
CYCLE_ACTIVITY.STALLS_L2_PENDING on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.
The HSW events include all accesses, not just demand loads. In contrast, the SNB events only occur for demand loads.
On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING, and that's why the %L2_Bound formula is valid on SNB.
Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.
Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING:
HSM63: The intended behavior of CYCLES_L2_PENDING on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.
HSM80: CYCLES_L2_PENDING may overcount due to requests from the next page prefetcher.
I think you can minimize the error in CYCLES_L2_PENDING by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.
Related: When L1 misses are a lot different than L2 accesses… TLB related?
Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLES_LDM_PENDING. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.

Resources