Why 24 bits registers? - microcontroller

In my work I deal with different micro-controllers, micro-processors and DSP processors. Many of them have 24-bits registers and counters.
I know how to use them, this is not my question.
My question is why do they have 24-bits register! why not make it 32 bit?
and as I know, it is not a problem of size, because the registers are already 32bits, but have maximum of 0xFFFFFF.
Do this provide easier HW implementation? Faster calculations?
Or it is just "hmmm, lets put 24-bits registers to make the job of programmers more hard"?

My guess is that most DSP applications simply don't need 32-bits. Digital audio uses 24-bits fidelity the most. Implementing 32-bits would require more transistors thus would result in higher costs.
Why would 32 bits be easier for the programmer?
Also, you state that the registers have a maximum of 0xFFFFFF, which makes them 24-bits by definition, not 32-bits as you suggest.

There is no particular reason for 8/16/32/64 bits. There are 24 bit DSPs, 18 bit PICs, 36 bit PDP... Each bit costs time, money and power so having enough bits is good enough. No need to over do it. Just look at the original PCs with 20 adress lines, even though the memory pointers could be up to 32 bits.

Tagging onto Tomas' answer, some DSPs have a register mode where overflowing locks the value at the highest state. If the data is 24-bit and it rolls over to the 25th bit, it should lock there, not at the 32-bit rollover.

For audio you would typically want 16 bit output. Since you lose some precision during processing they pick a reasonable size that is somewhat bigger than 16 bit, which happens to be 24 bit.
The reason not to go to full 32 bits is that that would need substantially more hardware, especially for multiplication.

Related

Do NaN-boxing and tagged pointers have a future on 64bit platforms?

On common 64bit architectures like x86-64 and arm64, usually only 48 bits are used for memory addressing, while the other bits are copies of bit 47 (which is usually zero for user-space programs). Thus, the remaining 16 bits can be used to store additional data like type tags etc., as long as those bits are masked off before dereferencing. Alternatively, the 48 bits can fit into the NaN-representation of a 64-bit float number. Both techniques are often used by dynamic/interpreted languages.
I've read about Intel 5-level-paging which would extend the address range from 48 to 57 bits, thus significantly reducing the leftover bits and also rendering NaN-boxing impossible. The Linux Kernel has already added support for this paging scheme.
Given that 48 bits correspond to 262,144 GiB of memory we can assume that we won't need the 57 bit range anytime soon on consumer devices like PCs, laptops and phones, and thus one might assume that on those devices we will remain on the 48 bit mode for a long time to come, with the above mentioned techniques remaining viable, while the 57 bit mode will be only used for servers/supercomputers.
Am I correct to make those assumptions? Or are there indicators that even consumer-scale devices will use the 57 bit mode in the near future?
Even if memory-mapped persistent storage becomes widespread (NV-DIMM), it'll be a while before consumer PCs have more than 64TiB or 128TiB of storage + DRAM. Remember that high-half kernels want half the virtual address space for kernel use, and typically want to direct-map all physical memory to a bit contiguous range of virtual addresses. As well as making other mappings in kernel space, I think. e.g. see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for what Linux does.
As you suspect, OSes wouldn't actually enable PML5 on computers that have far less than 256TiB of physical address space. There's no need for that much virtual address space and it has a performance cost (more expensive page-walks from another level of page tables). The page-walk hardware wouldn't always be able to keep the two actually-used top-level entries cached; invalidations of everything on CR3 changes can force flushing. (Page-walk hardware can in general cache upper levels of the radix tree to speed up TLB misses for nearby pages.)

Dividing A Register With 16 Bit Into 8 Bit Two Parts

i have encountered some registers in some websites and my textbook. Generally, 16 bit registers are divided into two parts. These two parts with 8 bits are classified L(low) and H(high).
Why is this performed ?
Is it that we work on the 8 bit registers ?
Do these low and high specify an input for utilizing different parts of the register ?
If the cpu you are talking about is from the family of the 8086 processor, yes, it has 16 bits general purpose registers that could be accessed directly, meaning moving the 16 bits at a time, or loading only one byte (8 bits) to either the lower part of the register (the less meaningull bits), or to the higher part of the register (the mos significant bits).
Why is this performed ?
I don't know all the reasons but there must be a mix of trade-offs at the time of designing the cpu. Remember that at that time the predecessors of this cpu were 8 bit cpus which could have caused some back compatibility requirements.
Is it that we work on the 8 bit registers ?
The cpu works with the 16 bits registers but you can address the lower part or the higher part individually.
Do these low and high specify an input for utilizing different parts
of the register ?
Input or output, you could also read their values.

What is the difference when programming microcontroller 8-bit, 16-bit, 32-bit?

People always want an engineer who can program/have experience in both 8, 16 and 32 bit controllers. I still cannot figure out is there a huge difference when we program 8, 16, 32 bit microcontroller? Or it's just different in terms of declaring variables ...
Similar to Weather Vane's comment. Is there a difference driving a wee bitty compact car, a pickup truck, a moving van and a semi truck (tractor-trailer)? Well, in a lot of respects they are the same, gas, brakes, steering wheel, seat, door, window, radio, headlights, all that stuff, you brake and gas and steer. Could you move your house with a smart car? Well not everything, although you could maybe get a small trailer for the bed and couch, but you could move a box or two at a time and make a ton of trips, the pickup fewer trips, the semi maybe one trip depends on how much stuff you have.
If you are programming high level like C then initially it will feel the same, gas, brake, steer. But it is like taking many trips to move your stuff or a few or one. You can in general do 64 bit math with an 8 bit processor, a 16 and we know a 32 because we do that all the time and naturally a 64 bit. It just takes more trips, you have to break the math down into parts and do the parts one at a time. And naturally like using your fiat to move the stuff in your apartment it takes a lot more time than using a truck.
Just saw someone here using floating point on a Microchip PIC, see that all too often (float on a microcontroller), if the compiler has a library for it it allows folks to try it, but they very quickly run out of flash and/or ram and their performance is dreadful.
So in some respects it is about knowing that variables are not the same sizes you are used to on windows or linux, that things take longer, you have a lot fewer resources, debugging is or can be quite a bit different, its like getting out of your VW bug which is the only car you have ever driven then getting in a pickup truck or moving van, and clipping the curbs or parked cars every time you turn, not being able to stop at the lights, etc. Eventually you get used to it. Unlike the moving truck, you usually cant hurt the bits in the processor when you screw up, you can let smoke out with software and brick a system sure, but hopefully more often the program doesnt build or doesnt run or runs really slow and you dont have to go to jail each time for running over a pedestrian standing on the corner.
In this day and age (and actually for a long time but very trivial now) you can find all kinds of simulators/emulators that you can try to get a feel for these platforms. The performance thing may be incorrect, but if you only have 1024 bytes of ram and a few Kbytes of flash, you are going to feel that right away, and wonder why only a few lines of high level code consume so many instructions. Understanding things like maybe on that 8 bit mcu I should use mostly 8 bit variables so I dont burn so much code (unless the compiler figures it out, still wasteful), likewise on the 16 use mostly 16 bit and the 32 mostly 32 bit. (trying to "conserve" memory on a larger platform can actually cost you more cycles having to sign extend or clip every operation if it doesnt have native instructions for that or adding extra instructions to do that sign extension or clipping).

What does CPU frequency represent in an Arduino board?

I'm new to Arduino and microcontrollers.
I was studying the specs and found that even the same board may have different frequencies with different input voltages (3.3V vs 5V). So the question is, what does frequency represent? Does it represent how many lines of assembly code it's able to run? Or the maximum PWM frequencies it's able to output?
A further question would be, if I'm looking for a board for a specific project, how do I decide which frequency I will need a priori, instead of trying everything out and see which one works?
What makes me more confused is that when it comes to computer CPUs, it seems that lower frequency CPUs can actually run faster than higher frequency ones (e.g. Intel). So how do I actually know how fast a microcontroller can run?
By frequency we mean the frequency of the CPU clock. Say your Arduino Uno runs on 16 MHz, which is 16,000,000 Hertz.
That means there are 16 Million clock cycles per second. The CPU executes the byte-code of the program. One Assembler instruction can actually take any number of CPU cycles to execute, usually between 1 and 4 cycles for simple stuff, and a little bit more heavy arithmetic and writing to memory. So it's a rough estimate of how many "lines of assembler" (that is, byte-code instructions) it can run per second. A measurement which is a little bit better is the "MIPS" value, the "Millions of instructions per second". There are other benchmark types for CPUs which are more accurate.
If you take at the datasheet for the AVR microprocessor architecture, you can see the cycles that each instructions needs: (link: http://www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf)
So for an ADD Rd, Rr instruction, an AVR CPU needs 1 clock cycle.
Take a desktop Intel CPU for example. It's common these days that they have a clock frequency of 2 GHz or more, which is 2 Billion cycles per second compared to the 16 million cycles per second on the Arduino's AVR CPU. So the Intel CPU beats the Arduino by far. Then again, the Arduino is designed for completley different stuff - it's a small microcontroller with low overhead, runs no OS etc. The use-case for such a CPU (and the architecture) is just different, which makes comparing them unjustified. There many other factors in play, like multi-core CPUs (4 Intel CPU cores vs. 1 AVR) and command pipelining, the speed of your memory / RAM, etc. It's really hard to compare a CPU to another one in every use-case possible, but for "general purpose computing", the Desktop CPUs (AMD, Intel, x64 architecture) far outruns the processing power of a mere Arduino AVR CPU.
I hope this clears up some confusion.
I think one confusion you have may be chip specific, I am not going to look it up right now but I do remember seeing this, the chip spec may say that for this input voltage range it can handle this frequency and for this voltage range it cannot. I think sparkfun the 3.3 are 8mhz and 5.0 are 16mhz or something like that. Anyway, that is not generally the case, but it is a chip by chip vendor by vendor thing and that is why you have to read the datasheet. Has nothing to do with arduinos or avrs specifically, just a general chip design thing.
How do you know how fast your microcontroller can run? That is a very loaded question, depends on your definition of fast. If it is simply what clock frequencies can I use, well "just read the datasheet" for that part, and then depending on your board design choose from what is available, if you do not have any external clocks then your choices may or may not be more limited, you may or may not have a pll that you can use to multiply the clock source.
if your definition of fast is how fast can I perform this task, how many whatevers per second or how much wall clock time does it take to complete some specific task. Well that is a benchmark problem and there are so many variables that there is actually no real answer. Yes it is very true that an x86 can have a lower clock and run faster than some other x86, historically the newer ones can do less stuff per clock than older ones for the same binaries, you have to then tune the compile to the newer chip and then you might get back some of your mips to mhz. but that is in part because you are using a different chip design that just speaks the same language (machine code). You can have a tall person that can recite a poem faster than a short person, both using english and the same poem, has nothing to do with them being short or tall, just that they are different humans.
There are different avr core variations but not remotely on par with the different x86 architectures. so while comparing a tiny vs an xmega you can probably have the xmega run "faster" at the same clock rate simply because it has more registers or a bigger address space, etc. But instructions per second is probably not really different, could be, but my guess is not so much.
Then there is the compiler, the compiler plays a huge role in how "fast" your code runs, change compilers or compiler versions or compiler settings and the machine code produced from the same high level source code (C for example) can vary greatly and as a result can have dramatic effects on the "speed" of the code. Take the dhrystone for example, very easy to demonstrate that the same exact source code on the same exact chip/board, same clock rates, etc can execute at vastly different speeds based on either using different compilers, versions or command line settings, kinda proving that the godfather of benchmarks is basically useless in providing any meaningful information.
Microcontrollers make the problem much worse as you often are running the program out of flash, and many, not all, but many have the ability to either divide or multiply or both the clock, but the flash is not always designed for the full range. You might have a chip that boots on an internal clock at 8mhz but you can use the pll to multiply that up to say 80mhz. But not uncommon that the flash is limited to say 16mhz on a chip like that so at 8mhz the flash can deliver an item say an instruction every cpu clock, but at 20mhz you have to put a wait state and although the cpu is running much faster you can only feed it at 16mhz so it is waiting around more, and then acts fast when it gets something, is it really "faster" or is clocking up making you slower. Certainly at just under 16mhz in this fantasy chip I am describing you can keep it to zero wait states so it is really faster, not necessarily twice as as there are other factors, but definitely faster than 8mhz. just at or above 16mhz though you take a huge performance hit compared to just under 16mhz. at just under 32mhz though it is pretty fast compared to just under, then at just over 32mhz another wait state setting and much slower again even though the clock is basically the same and so on.
Then there is the fetching, how does the cpu actually fetch, like an arm where it fetches a bunch of data per fetch transaction, even if it is not going to execute all of them if you branch to 0x1004 and at that address there is a branch to 0x2008 the core might fetch 0x10 bytes from 0x1000 to 0x100F, THEN extract the 0x1004 word/instruction, decode it to find it is a branch then read 0x10 bytes from 0x2000. basically reading 0x20 bytes to find 2 instructions. Take two instructions if both are in the 0x10 bytes then good if one is at 0x100C and the other at 0x2000 that is a performance hit. take this internal information and apply it to an application and all of its jumping around, changing one line of code or adding or removing a single nop to the bootstrap (causing the alignment of the program to change in the address space) can cause anywhere from a tiny to a large change in performance, swap two helper function sin the source code of your program, in the text, causing them to land in different address spaces once compiled, can have little to major performance effects without actually changing the functions themselves.
So performance is first of a foolish task to go after in one respect, in the other respect all that matters is your program as written with the compiler you are using on the hardware you are using, it is as fast as it is, and there are things you can do to make that code faster on that compiler on that target on that day, by changing compiler settings or the code or both. And ideally you build your final firmware, performance test that, and never build again as if you build a year or two later it may be on a different host compiler with a different compiler or compiler version and all bets are off on performance.
How do you pick a board, how much flash, ram, features, clock rate. A lot of it is experience by just trial and error, you fortunately live in a time where you can literally try hundreds of boards all of which cost anywhere from a few bucks to like 10 or 20 each, different cpu architectures different chip vendors, etc. there are many compiler choices and even languages available, basically there are too many easy to acquire choices, unlike back in the day when the parts were pretty cheap but you may have had to build your own board, write your code in asm, maybe even create your own assembler, etc. Have a rom programmer that cost hundreds to thousands of dollars. So go with the AVR you have and play with its features, play with the compiler and/or write or both. Do experiments to see if there are fetch effects or not. If you have clocking choices mess with those see what happens.
Of course all of this starts with reading the chip documentation from the vendor.

32 bit operation vs 64 bit operation on a 64bit machine/OS

Which operation i.e a 32 bit operation or a 64 bit operation (like masking a 32 bit flag or a 64 bit flag), would be cheaper on a 64 bit machine?
As you don;t specify an architecture, I can suggest only a general answer, as it depends on the operation and on the processor architecture in question. Once you have the data in a CPU register, then most operations will usually take the same amount of time regardless of whether the value was originally 32 or 64 bit.
However, there can be some differences on some architectures in how the data gets into a register. Here are some situations where a "native" value may be faster than a smaller value on some hardware:
Fetching data
Fetching a "native sized" value may be faster than fetching a smaller value. That is, the processor may need to fetch 64 bits regardless, and then mask/shift off 32 bits of it to "load" a 32-bit value. This masking/shifting is not required when working on a 64 bit value, so it can possibly be loaded faster. (This goes against the intuitive idea that something twice as big might take twice as long to load).
Alternatively, if the bus can handle half-width fetches, then 32 bits may be loaded in the same time as a 64 bit value.
To confuse matters more, the CPU caches can change results as well. Usually when you read one value from memory, a "line" of several memory locations are read into the cache, so that subsequent reads can be supplied from fast cache memory instead of requiring a full fetch from RAM. In which case using 32 bit values will work out faster if you are accessing many values in sequence, as twice as many of them will be cached, resulting in fewer cache misses.
Computation
the processor hardware is optimised for dealing with 64-bit values, so calculating values using 32 bits may cause it more trouble, and thus could slow things down. e.g. It might be able to process a double (64-bit) value "natively" but have to convert a float (32-bit) value into a double before it can process it, then convert the result back to a float afterwards.
Alternatively, there may be 32-bit and 64-bit paths through the CPU, or the CPU may be able to do any conversions required in a way that does not affect the overall execution time of the instruction, in which case they may be calculated at the same speed.
This may affect complex operations (floating point) but is unlikely to be a problem with simple ops (AND, OR, etc)
Generally speaking a 64 bit operation or a 32 bit operation would have the same cost. The 32-bit operation might end up taking an extra instruction depending on if the compiler needed to ensure that the upper 32-bits of a 64-bit register was cleared (or sign-extended), but that operation generally has little cost.
There might be some difference in instruction encoding that might make one take more space than the other, but that (and which way the advantage would lie) would depend on a number of factors.
It depends -- masking a flag will normally use an AND instruction, which will execute quickly (~1 cycle) once the data is in a register. Loading 64 bits of data from memory will generally be slower than loading 32 bits of data -- but if you're using more than 32 flags, you'll have to load more than 32 bits of data anyway, and handling the masking in one cycle will improve speed over doing it in two or three instructions. Whether any of this makes a difference to overall speed will generally depend on surrounding instructions -- for example, if the data is already in the cache anyway, you may not need to load it from memory.
In other words, it's difficult to make generalizations -- you just about have to look at a specific code sequence (not just one instruction, but a whole sequence) to say anything -- and the result for that sequence may not mean much about another sequence that initially looks almost identical.

Resources