How does the Program read 32 bit from the memory in a single clock cycle? - intel

So, I have this assignment where I need to design a RISC-32-bit 5 stage pipeline. I must support at least 32 (32-bit) instructions and 32 (32-bit) data values. The memory should be read in 1 clock cycle. Now, for this, I have used a word addressable memory (1 address will contain 32 bits). But, I want to make this byte addressable.
One way of doing this is making the external clock four times slower and then passing these into the other stages of the pipeline. But passing the original clock into the memory part. But, this will make the simulation a bit hectic, like I have to run the clock 20 times (instead of 5).
Another way of doing this will be running a clock (attached to the memory) that will be four times faster than the external clock. So, by the time a single clock cycle passes, memory will be accessed four times so that we would have brought the complete 32-bit. But, circuits for doubling/quadrupling the frequency of a clock seem too complicated.
Are there simpler frequency doubler circuits that can be implemented, or is there any other way to do this?
I am using logisim-evolution to simulate this, and for the memory part, I have used the in-built RAM.
This is the RAM:

The normal way to make a 32-bit byte-addressable memory is to have four 8-bit memory subsystems that are all fed the top N-2 bits of the byte address. When doing a 32-bit load or store, all four memory subsystems will be active. When doing a 16-bit load or store, the second-from-the-bottom address bit will be used control whether to activate the first and second subsystems or the third and fourth. When doing an 8-bit load or store, the bottom address bit will select between the first and second, or between the third and fourth, subsystem.

Related

Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers?

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

What does CPU frequency represent in an Arduino board?

I'm new to Arduino and microcontrollers.
I was studying the specs and found that even the same board may have different frequencies with different input voltages (3.3V vs 5V). So the question is, what does frequency represent? Does it represent how many lines of assembly code it's able to run? Or the maximum PWM frequencies it's able to output?
A further question would be, if I'm looking for a board for a specific project, how do I decide which frequency I will need a priori, instead of trying everything out and see which one works?
What makes me more confused is that when it comes to computer CPUs, it seems that lower frequency CPUs can actually run faster than higher frequency ones (e.g. Intel). So how do I actually know how fast a microcontroller can run?
By frequency we mean the frequency of the CPU clock. Say your Arduino Uno runs on 16 MHz, which is 16,000,000 Hertz.
That means there are 16 Million clock cycles per second. The CPU executes the byte-code of the program. One Assembler instruction can actually take any number of CPU cycles to execute, usually between 1 and 4 cycles for simple stuff, and a little bit more heavy arithmetic and writing to memory. So it's a rough estimate of how many "lines of assembler" (that is, byte-code instructions) it can run per second. A measurement which is a little bit better is the "MIPS" value, the "Millions of instructions per second". There are other benchmark types for CPUs which are more accurate.
If you take at the datasheet for the AVR microprocessor architecture, you can see the cycles that each instructions needs: (link: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf)
So for an ADD Rd, Rr instruction, an AVR CPU needs 1 clock cycle.
Take a desktop Intel CPU for example. It's common these days that they have a clock frequency of 2 GHz or more, which is 2 Billion cycles per second compared to the 16 million cycles per second on the Arduino's AVR CPU. So the Intel CPU beats the Arduino by far. Then again, the Arduino is designed for completley different stuff - it's a small microcontroller with low overhead, runs no OS etc. The use-case for such a CPU (and the architecture) is just different, which makes comparing them unjustified. There many other factors in play, like multi-core CPUs (4 Intel CPU cores vs. 1 AVR) and command pipelining, the speed of your memory / RAM, etc. It's really hard to compare a CPU to another one in every use-case possible, but for "general purpose computing", the Desktop CPUs (AMD, Intel, x64 architecture) far outruns the processing power of a mere Arduino AVR CPU.
I hope this clears up some confusion.
I think one confusion you have may be chip specific, I am not going to look it up right now but I do remember seeing this, the chip spec may say that for this input voltage range it can handle this frequency and for this voltage range it cannot. I think sparkfun the 3.3 are 8mhz and 5.0 are 16mhz or something like that. Anyway, that is not generally the case, but it is a chip by chip vendor by vendor thing and that is why you have to read the datasheet. Has nothing to do with arduinos or avrs specifically, just a general chip design thing.
How do you know how fast your microcontroller can run? That is a very loaded question, depends on your definition of fast. If it is simply what clock frequencies can I use, well "just read the datasheet" for that part, and then depending on your board design choose from what is available, if you do not have any external clocks then your choices may or may not be more limited, you may or may not have a pll that you can use to multiply the clock source.
if your definition of fast is how fast can I perform this task, how many whatevers per second or how much wall clock time does it take to complete some specific task. Well that is a benchmark problem and there are so many variables that there is actually no real answer. Yes it is very true that an x86 can have a lower clock and run faster than some other x86, historically the newer ones can do less stuff per clock than older ones for the same binaries, you have to then tune the compile to the newer chip and then you might get back some of your mips to mhz. but that is in part because you are using a different chip design that just speaks the same language (machine code). You can have a tall person that can recite a poem faster than a short person, both using english and the same poem, has nothing to do with them being short or tall, just that they are different humans.
There are different avr core variations but not remotely on par with the different x86 architectures. so while comparing a tiny vs an xmega you can probably have the xmega run "faster" at the same clock rate simply because it has more registers or a bigger address space, etc. But instructions per second is probably not really different, could be, but my guess is not so much.
Then there is the compiler, the compiler plays a huge role in how "fast" your code runs, change compilers or compiler versions or compiler settings and the machine code produced from the same high level source code (C for example) can vary greatly and as a result can have dramatic effects on the "speed" of the code. Take the dhrystone for example, very easy to demonstrate that the same exact source code on the same exact chip/board, same clock rates, etc can execute at vastly different speeds based on either using different compilers, versions or command line settings, kinda proving that the godfather of benchmarks is basically useless in providing any meaningful information.
Microcontrollers make the problem much worse as you often are running the program out of flash, and many, not all, but many have the ability to either divide or multiply or both the clock, but the flash is not always designed for the full range. You might have a chip that boots on an internal clock at 8mhz but you can use the pll to multiply that up to say 80mhz. But not uncommon that the flash is limited to say 16mhz on a chip like that so at 8mhz the flash can deliver an item say an instruction every cpu clock, but at 20mhz you have to put a wait state and although the cpu is running much faster you can only feed it at 16mhz so it is waiting around more, and then acts fast when it gets something, is it really "faster" or is clocking up making you slower. Certainly at just under 16mhz in this fantasy chip I am describing you can keep it to zero wait states so it is really faster, not necessarily twice as as there are other factors, but definitely faster than 8mhz. just at or above 16mhz though you take a huge performance hit compared to just under 16mhz. at just under 32mhz though it is pretty fast compared to just under, then at just over 32mhz another wait state setting and much slower again even though the clock is basically the same and so on.
Then there is the fetching, how does the cpu actually fetch, like an arm where it fetches a bunch of data per fetch transaction, even if it is not going to execute all of them if you branch to 0x1004 and at that address there is a branch to 0x2008 the core might fetch 0x10 bytes from 0x1000 to 0x100F, THEN extract the 0x1004 word/instruction, decode it to find it is a branch then read 0x10 bytes from 0x2000. basically reading 0x20 bytes to find 2 instructions. Take two instructions if both are in the 0x10 bytes then good if one is at 0x100C and the other at 0x2000 that is a performance hit. take this internal information and apply it to an application and all of its jumping around, changing one line of code or adding or removing a single nop to the bootstrap (causing the alignment of the program to change in the address space) can cause anywhere from a tiny to a large change in performance, swap two helper function sin the source code of your program, in the text, causing them to land in different address spaces once compiled, can have little to major performance effects without actually changing the functions themselves.
So performance is first of a foolish task to go after in one respect, in the other respect all that matters is your program as written with the compiler you are using on the hardware you are using, it is as fast as it is, and there are things you can do to make that code faster on that compiler on that target on that day, by changing compiler settings or the code or both. And ideally you build your final firmware, performance test that, and never build again as if you build a year or two later it may be on a different host compiler with a different compiler or compiler version and all bets are off on performance.
How do you pick a board, how much flash, ram, features, clock rate. A lot of it is experience by just trial and error, you fortunately live in a time where you can literally try hundreds of boards all of which cost anywhere from a few bucks to like 10 or 20 each, different cpu architectures different chip vendors, etc. there are many compiler choices and even languages available, basically there are too many easy to acquire choices, unlike back in the day when the parts were pretty cheap but you may have had to build your own board, write your code in asm, maybe even create your own assembler, etc. Have a rom programmer that cost hundreds to thousands of dollars. So go with the AVR you have and play with its features, play with the compiler and/or write or both. Do experiments to see if there are fetch effects or not. If you have clocking choices mess with those see what happens.
Of course all of this starts with reading the chip documentation from the vendor.

Arduino measuring power to run code

I have c code running on bear metal (no OS). The code takes in some sensor data, performs a computation, forms a packet and transmits. The board is battery powered.
I'm interested in knowing the energy consumed for each operation in Jules. Is this possible? How would one go about doing it?
The number of joules used per instruction depends upon the processor you are using and which instruction you are looking at. I believe the ARM and the Atmel AVR processors have no real hardware power management which makes things simpler.
How much energy an instruction uses has to do with how much and what type of on-silicon circuitry it uses. This means that trying to theoretically compute the number of joules will be complicated since it is not simply related to the number of cycles the instruction uses.
So you’ll have to do it experimentally. Here’s what I’d do.
Remove all compiler optimizations
Do a frequency analysis to find your hot operations and pick out the most used. (I’m assuming we aren’t talking ASM instructions but ‘C’ instructions.)
Write a loop that repeats the instruction, say, 20 times, and have the loop run for several seconds at a minimum
Replace your battery with a power supply.
Use the series resistor to measure power as mentioned in the comment but (of course) scale the voltage appropriately.
Run the looping program and get a statistical sampling of the power.
Do this for all of your hot operations.
Compute power usage for your program
Validate the power usage against real power measurements of the execution of your program
Adjust (i.e. normalize) your computations as appropriate
You’ll also have to take into account the memory hierarchy. Accessing off chip memory takes energy. When operations or data are cached, it’s going to change your energy equation.
I figure this should work but don’t know. Good luck.

Why are 32-bit registers divided into 4 parts?

I'm learning about registers. It looks like 32-bit registers are divided up so that they can be accessed as 8-bit registers. This looks very inefficient. Performance would be improved if they didn't do this. So why do they do it?
Also, it costs extra money to design them like this. Why not make the CPU cheaper by not doing it?
Because if you're only dealing with 8bit values, it'd be inefficient to have issue all the bitmasks to limit those 32/64bit register to just the 8bits you're working on.
So, x86 registers have
AH/AL = high/low 8bits of a 16bit register
AX = whole 16bit register
EAX = whole 32bit register
It's far more efficient, in terms of instruction size to have
mov ah, 0xXX (2 bytes)
rather than forcing
mov ax, 0x00XX (3 bytes)
mov eax, 0x000000XX (7 bytes)
As for "designing the cpu to make it cheaper" - it's for backwards compatibility. All modern x86 processors are actually internally a RISC design, with a major chunk of silicon dedicated to taking the x86 instructions coming in and converting them into the CPU's own internal micro-ops (which is basically a RISC instruction set).
The Intel 8080, which was the first "mainstream" microprocessor, had seven main 8-bit registers (A, B, C, D, E, H, and L). Because memory addresses were 16 bits, instructions that needed to use a non-constant memory operand would use a pair of registers (most commonly H and L, but sometimes B and C, or D and E) to form the address. Because the registers in the aforementioned pairs were often used together to represent 16-bit values, there were a few instructions which could operate upon the register pairs as 16-bit quantities. An instruction to add BC to HL would perform the addition by adding C to L, and then by adding B to H (plus a carry if needed). I'm not familiar enough with the 4004 or 8008 (the two predecessors of the 8080) to know if either of them did anything similar in its architecture.
When Intel produced the 8088, they included a full 16-bit arithmetic unit, but they wanted code which was written for the 8080 to be easily convertible to their new architecture. On the 8080, a lot of code had been written to "manually" form addresses out of the 8-bit parts, since doing so was often much faster than using the 16-bit instructions to do the math. For example, if one needed to access some specified table of 256 entries with an index stored in A, one could have done something like (Zilog notation show, but the 8080 had the same instructions):
ld hl,(baseOfTable) ; 16-bit address
ld c,a
ld b,#0
add hl,bc
ld a,(hl)
but if one could make certain the table was aligned on a 256-byte boundary, one could simplify the code considerably:
ld l,a
ld a,(tableBaseMSB) ; Just load the MSB--assume the LSB is zero
ld h,a
ld a,(hl)
With the 8088 instruction set, it wouldn't terribly often be useful for code written "from scratch" to access the upper and lower parts of registers separately, but there was a lot of code written for the 8080 which used such techniques, and Intel wanted to make it easy for people to convert such code for use on the 8088. Allowing registers to be built from 8-bit pieces was helpful in that regard.
Incidentally, there was another advantage to Intel's architecture: since it included four 16-bit only registers and four registers which could be used as either one 16-bit or two 8-bit parts, that made it possible for code to hold 12 values in registers if eight of them were 255 or less, or eleven values if six of them were 256 or less, etc. When using architectures with more registers, eking out an extra register here and there isn't quite so important, but on the 8088 it was often very helpful.
The ability to address portions of the registers has no effect on their performance when used as 32-bit registers. In that case, this capability just isn't used.
CPUs, regardless of their native bit size, need to manipulate 8-bit values very, very often. Strings of text, for example, are frequently manipulated as consecutive 8-bit values. International character sets are often manipulated as sets of consecutive 16-bit values. So being able to operate rapidly on 8-bit and 16-bit values is of tremendous importance.
If you're asking as a practical matter for x86 CPUs, it's too late. The very first PC CPUs didn't even have 32-bit registers, and compatibility has been retained all the way through.
Backwards compatibility. Processor manufacturers did not wanted to break compatibility with old software. This is the main reason why x86_64 processors still support 16bit software(virtual mode). If you look closely you'll see that majority of the features in x86 architecture are shaped by compatibility concerns. I'm no hating.

32 bit operation vs 64 bit operation on a 64bit machine/OS

Which operation i.e a 32 bit operation or a 64 bit operation (like masking a 32 bit flag or a 64 bit flag), would be cheaper on a 64 bit machine?
As you don;t specify an architecture, I can suggest only a general answer, as it depends on the operation and on the processor architecture in question. Once you have the data in a CPU register, then most operations will usually take the same amount of time regardless of whether the value was originally 32 or 64 bit.
However, there can be some differences on some architectures in how the data gets into a register. Here are some situations where a "native" value may be faster than a smaller value on some hardware:
Fetching data
Fetching a "native sized" value may be faster than fetching a smaller value. That is, the processor may need to fetch 64 bits regardless, and then mask/shift off 32 bits of it to "load" a 32-bit value. This masking/shifting is not required when working on a 64 bit value, so it can possibly be loaded faster. (This goes against the intuitive idea that something twice as big might take twice as long to load).
Alternatively, if the bus can handle half-width fetches, then 32 bits may be loaded in the same time as a 64 bit value.
To confuse matters more, the CPU caches can change results as well. Usually when you read one value from memory, a "line" of several memory locations are read into the cache, so that subsequent reads can be supplied from fast cache memory instead of requiring a full fetch from RAM. In which case using 32 bit values will work out faster if you are accessing many values in sequence, as twice as many of them will be cached, resulting in fewer cache misses.
Computation
the processor hardware is optimised for dealing with 64-bit values, so calculating values using 32 bits may cause it more trouble, and thus could slow things down. e.g. It might be able to process a double (64-bit) value "natively" but have to convert a float (32-bit) value into a double before it can process it, then convert the result back to a float afterwards.
Alternatively, there may be 32-bit and 64-bit paths through the CPU, or the CPU may be able to do any conversions required in a way that does not affect the overall execution time of the instruction, in which case they may be calculated at the same speed.
This may affect complex operations (floating point) but is unlikely to be a problem with simple ops (AND, OR, etc)
Generally speaking a 64 bit operation or a 32 bit operation would have the same cost. The 32-bit operation might end up taking an extra instruction depending on if the compiler needed to ensure that the upper 32-bits of a 64-bit register was cleared (or sign-extended), but that operation generally has little cost.
There might be some difference in instruction encoding that might make one take more space than the other, but that (and which way the advantage would lie) would depend on a number of factors.
It depends -- masking a flag will normally use an AND instruction, which will execute quickly (~1 cycle) once the data is in a register. Loading 64 bits of data from memory will generally be slower than loading 32 bits of data -- but if you're using more than 32 flags, you'll have to load more than 32 bits of data anyway, and handling the masking in one cycle will improve speed over doing it in two or three instructions. Whether any of this makes a difference to overall speed will generally depend on surrounding instructions -- for example, if the data is already in the cache anyway, you may not need to load it from memory.
In other words, it's difficult to make generalizations -- you just about have to look at a specific code sequence (not just one instruction, but a whole sequence) to say anything -- and the result for that sequence may not mean much about another sequence that initially looks almost identical.

Resources