How does a processor calculate bigger than its register value? - cpu-registers

So far I learned that a processor has registers, for 32 bit processor
they are 32 bits, for 64 bit they are 64 bits. So can someone explain
what happens if I give to the processor a larger value than its register
size? How is the calculation performed?

It depends.
Assuming x86 for the sake of discussion, 64-bit integers can still be handled "natively" on a 32-bit architecture. In this case, the program often uses a pair of 32-bit registers to hold the 64-bit value. For example, the value 0xDEADBEEF2B84F00D might be stored in the EDX:EAX register pair:
eax = 0x2B84F00D
edx = 0xDEADBEEF
The CPU actually expects 64-bit numbers in this format in some cases (IDIV, for example).
Math operations are done in multiple instructions. For example, a 64-bit add on a 32-bit x86 CPU is done with an add of the lower DWORDs, and then an adc of the upper DWORDs, which takes into account the carry flag from the first addition.
For even bigger integers, an arbitrary-precision arithmetic (or "big int") library is used. Here, a dynamically-sized array of bytes is used to represent the integer, with additional information (like the number of bits used). GMP is a popular choice.
Mathematical operations on big integers are done iteratively, probably in native word-size values at-a-time. For the gory details, I suggest you have a look through the source code of one of these open-source libraries.
The key to all of this, is that numeric operations are carried out in manageable pieces, and combined to produce the final result.

Related

What is the difference between AVX2 and AVX-512?

In terms of SIMD and parallelization, what is the difference between AVX2 and AVX-512? Are they the same thing or different? I just see that double8 is used in AVX-512 and double4 is used for AVX2?
I am using PyOpenCL to write kernel code in C and not sure what the difference would be.
AVX2 is a 256 bit vector instruction set. You have 256 bit registers which can be interpreted several ways (8 floats, 4 doubles, 32 bytes, etc). AVX1 supports only floating point operations, AVX2 adds 256 bit integer operations. AVX-512 is a set of 512 bit vector instructions. There are only 2 flavors of AVX, plain old AVX and AVX2. AVX-512 comes in many different flavors. You may find Intel's Intrinsics Guide interesting.
The biggest difference is simply getting twice as many operations processed per instruction. Though, there are certain instructions in AVX-512 which may make some specific things more optimal (exponent approximations, for example).

Compute sum of bits efficiently with SSE

I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.
I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.
In lines 80-105 I make this sum by using 64-bit arithmetic.
In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.
I am running this on an Intel(R) i7-4980HQ CPU # 2.80GHz with gcc 7.2.0.
Do you know a way around this?
The low part can be trivially saved with movq instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a); intrinsic storing to memory, or _mm_cvtsi128_si64 storing to register.
There is also a _mm_storeh_pd counterpart, which requires cast to pd and may cause a stall due to mixing floating points and integers.
The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e) and then saved with movq.

MSP430 instruction cycles

Hi i am working on Tmote sky motes (MSP430 microprocessor) with contiki os. I want to know the number of instruction cycles used when I do a multiplication operation in my programming (software).
Thank you,
Avijit
The msp430 is a 16-bit system, so 32-bit values are not supported directly. A 32-bit operation is typically translated to assembly code as a sequence of 16-bit ops.
The execution times of 8-bit and 16-bit operations can be found in TI application report "The MSP430Hardware Multiplier":
Table 4. CPU Cycles Needed With Different Multiplication Modes
OPERATION: Unsigned Multiply (MPY)
SOFTWARE LOOP: 139...171
HARDWARE MPYer: 8
SPEED INCREASE: 17.4...21.4
OPERATION: Unsigned multiply-and-accumulate (MAC)
SOFTWARE LOOP: 137...169
HARDWARE MPYer: 8
SPEED INCREASE: 17.1...21.1
OPERATION: Signed Multiply (MPYS)
SOFTWARE LOOP: 145...179
HARDWARE MPYer: 8
SPEED INCREASE: 18.1...22.4
OPERATION: Signed multiply-and-accumulate (MAC)
SOFTWARE LOOP: 143...177
HARDWARE MPYer: 17
SPEED INCREASE: 8.4...10.4
The HW multiplier should be active with default compilation settings, but check the generated object file with msp430-objdump to make sure.
You can use naken_asm by Michael Kohn to disaemble an Intel hex or ELF file and it will calculate the cycle counts for each instruction. I've used it in the past and the cycle counter is OK for CPU (such as in your Tmote) but not fully supported in CPUX.
You can invoke it from the command line as simply as:
naken_util -disasm <infile>
where <infile> is the name of your hex or ELF file. The default processor is MSP430, but you'd need the assembly listing from your compiler in order to be able to match up the original code with the disassembled code which includes cycle counts.
Another alternative would be to use MSPDebug's tracer option which can track running software and provide an up-to-date instruction cycle count. However, I've never used it for that purpose so cannot provide an example.

Bad floating-point magic

I have a strange floating-point problem.
Background:
I am implementing a double-precision (64-bit) IEEE 754 floating-point library for an 8-bit processor with a large integer arithmetic co-processor. To test this library, I am comparing the values returned by my code against the values returned by Intel's floating-point instructions. These don't always agree, because Intel's Floating-Point Unit stores values internally in an 80-bit format, with a 64-bit mantissa.
Example (all in hex):
X = 4C816EFD0D3EC47E:
biased exponent = 4C8 (true exponent = 1C9), mantissa = 116EFD0D3EC47E
Y = 449F20CDC8A5D665:
biased exponent = 449 (true exponent = 14A), mantissa = 1F20CDC8A5D665
Calculate X * Y
The product of the mantissas is 10F5643E3730A17FF62E39D6CDB0, which when rounded to 53 (decimal) bits is 10F5643E3730A1 (because the top bit of 7FF62E39D6CDB0 is zero). So the correct mantissa in the result is 10F5643E3730A1.
But if the computation is carried out with a 64-bit mantissa, 10F5643E3730A17FF62E39D6CDB0 is rounded up to 10F5643E3730A1800, which when rounded again to 53 bits becomes 10F5643E3730A2. The least significant digit has changed from 1 to 2.
To sum up: my library returns the correct mantissa 10F5643E3730A1, but the Intel hardware returns (correctly) 10F5643E3730A2, because of its internal 64-bit mantissa.
The problem:
Now, here's what I don't understand: sometimes the Intel hardware returns 10F5643E3730A1 in the mantissa! I have two programs, a Windows console program and a Windows GUI program, both built by Qt using g++ 4.5.2. The console program returns 10F5643E3730A2, as expected, but the GUI program returns 10F5643E3730A1. They are using the same library function, which has the three instructions:
fldl -0x18(%ebp)
fmull -0x10(%ebp)
fstpl 0x4(%esp)
And these three instructions compute a different result in the two programs. (I have stepped through them both in the debugger.) It seems to me that this might be something that Qt does to configure the FPU in its GUI startup code, but I can't find any documentation about this. Does anybody have any idea what's happening here?
The instructions stream of and inputs to a function do not uniquely determine its execution. You must also consider the environment that is already established in the processor at the time of its execution.
If you inspect the x87 control word, you will find that it is set in two different states, corresponding to your two observed behaviors. In one, the precision control [bits 9:8] has been set to 10b (53 bits). In the other, it is set to 11b (64 bits).
As to exactly what is establishing the non-default state, it could be anything that happens in that thread prior to execution of your code. Any libraries that are pulled in are likely suspects. If you want to do some archaeology, the smoking gun is typically the fldcw instruction (though the control word can also be written to by fldenv, frstor, and finit.
normally it's a compiler setting. Check for example the following page for Visual C++:
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
or this document for intel:
http://cache-www.intel.com/cd/00/00/34/76/347605_347605.pdf
Especially the intel document mentions some flags inside the processor that determine the behavior of the FPU instructions. This explains why the same code behaves differently in 2 programs (one sets the flags different to the other).

Why are 8 and 256 such important numbers in computer sciences?

I don't know very well about RAM and HDD architecture, or how electronics deals with chunks of memory, but this always triggered my curiosity:
Why did we choose to stop at 8 bits for the smallest element in a computer value ?
My question may look very dumb, because the answer are obvious, but I'm not very sure...
Is it because 2^3 allows it to fit perfectly when addressing memory ?
Are electronics especially designed to store chunk of 8 bits ? If yes, why not use wider words ?
It is because it divides 32, 64 and 128, so that processor words can be be given several of those words ?
Is it just convenient to have 256 value for such a tiny space ?
What do you think ?
My question is a little too metaphysical, but I want to make sure it's just an historical reason and not a technological or mathematical reason.
For the anecdote, I was also thinking about the ASCII standard, in which most of the first characters are useless with stuff like UTF-8, I'm also trying to think about some tinier and faster character encoding...
Historically, bytes haven't always been 8-bit in size (for that matter, computers don't have to be binary either, but non-binary computing has seen much less action in practice). It is for this reason that IETF and ISO standards often use the term octet - they don't use byte because they don't want to assume it means 8-bits when it doesn't.
Indeed, when byte was coined it was defined as a 1-6 bit unit. Byte-sizes in use throughout history include 7, 9, 36 and machines with variable-sized bytes.
8 was a mixture of commercial success, it being a convenient enough number for the people thinking about it (which would have fed into each other) and no doubt other reasons I'm completely ignorant of.
The ASCII standard you mention assumes a 7-bit byte, and was based on earlier 6-bit communication standards.
Edit: It may be worth adding to this, as some are insisting that those saying bytes are always octets, are confusing bytes with words.
An octet is a name given to a unit of 8 bits (from the Latin for eight). If you are using a computer (or at a higher abstraction level, a programming language) where bytes are 8-bit, then this is easy to do, otherwise you need some conversion code (or coversion in hardware). The concept of octet comes up more in networking standards than in local computing, because in being architecture-neutral it allows for the creation of standards that can be used in communicating between machines with different byte sizes, hence its use in IETF and ISO standards (incidentally, ISO/IEC 10646 uses octet where the Unicode Standard uses byte for what is essentially - with some minor extra restrictions on the latter part - the same standard, though the Unicode Standard does detail that they mean octet by byte even though bytes may be different sizes on different machines). The concept of octet exists precisely because 8-bit bytes are common (hence the choice of using them as the basis of such standards) but not universal (hence the need for another word to avoid ambiguity).
Historically, a byte was the size used to store a character, a matter which in turn builds on practices, standards and de-facto standards which pre-date computers used for telex and other communication methods, starting perhaps with Baudot in 1870 (I don't know of any earlier, but am open to corrections).
This is reflected by the fact that in C and C++ the unit for storing a byte is called char whose size in bits is defined by CHAR_BIT in the standard limits.h header. Different machines would use 5,6,7,8,9 or more bits to define a character. These days of course we define characters as 21-bit and use different encodings to store them in 8-, 16- or 32-bit units, (and non-Unicode authorised ways like UTF-7 for other sizes) but historically that was the way it was.
In languages which aim to be more consistent across machines, rather than reflecting the machine architecture, byte tends to be fixed in the language, and these days this generally means it is defined in the language as 8-bit. Given the point in history when they were made, and that most machines now have 8-bit bytes, the distinction is largely moot, though it's not impossible to implement a compiler, run-time, etc. for such languages on machines with different sized bytes, just not as easy.
A word is the "natural" size for a given computer. This is less clearly defined, because it affects a few overlapping concerns that would generally coïncide, but might not. Most registers on a machine will be this size, but some might not. The largest address size would typically be a word, though this may not be the case (the Z80 had an 8-bit byte and a 1-byte word, but allowed some doubling of registers to give some 16-bit support including 16-bit addressing).
Again we see here a difference between C and C++ where int is defined in terms of word-size and long being defined to take advantage of a processor which has a "long word" concept should such exist, though possibly being identical in a given case to int. The minimum and maximum values are again in the limits.h header. (Indeed, as time has gone on, int may be defined as smaller than the natural word-size, as a combination of consistency with what is common elsewhere, reduction in memory usage for an array of ints, and probably other concerns I don't know of).
Java and .NET languages take the approach of defining int and long as fixed across all architecutres, and making dealing with the differences an issue for the runtime (particularly the JITter) to deal with. Notably though, even in .NET the size of a pointer (in unsafe code) will vary depending on architecture to be the underlying word size, rather than a language-imposed word size.
Hence, octet, byte and word are all very independent of each other, despite the relationship of octet == byte and word being a whole number of bytes (and a whole binary-round number like 2, 4, 8 etc.) being common today.
Not all bytes are 8 bits. Some are 7, some 9, some other values entirely. The reason 8 is important is that, in most modern computers, it is the standard number of bits in a byte. As Nikola mentioned, a bit is the actual smallest unit (a single binary value, true or false).
As Will mentioned, this article http://en.wikipedia.org/wiki/Byte describes the byte and its variable-sized history in some more detail.
The general reasoning behind why 8, 256, and other numbers are important is that they are powers of 2, and computers run using a base-2 (binary) system of switches.
ASCII encoding required 7 bits, and EBCDIC required 8 bits. Extended ASCII codes (such as ANSI character sets) used the 8th bit to expand the character set with graphics, accented characters and other symbols.Some architectures made use of proprietary encodings; a good example of this is the DEC PDP-10, which had a 36 bit machine word. Some operating sytems on this architecture used packed encodings that stored 6 characters in a machine word for various purposes such as file names.
By the 1970s, the success of the D.G. Nova and DEC PDP-11, which were 16 bit architectures and IBM mainframes with 32 bit machine words was pushing the industry towards an 8 bit character by default. The 8 bit microprocessors of the late 1970s were developed in this environment and this became a de facto standard, particularly as off-the shelf peripheral ships such as UARTs, ROM chips and FDC chips were being built as 8 bit devices.
By the latter part of the 1970s the industry settled on 8 bits as a de facto standard and architectures such as the PDP-8 with its 12 bit machine word became somewhat marginalised (although the PDP-8 ISA and derivatives still appear in embedded sytem products). 16 and 32 bit microprocessor designs such as the Intel 80x86 and MC68K families followed.
Since computers work with binary numbers, all powers of two are important.
8bit numbers are able to represent 256 (2^8) distinct values, enough for all characters of English and quite a few extra ones. That made the numbers 8 and 256 quite important.
The fact that many CPUs (used to and still do) process data in 8bit helped a lot.
Other important powers of two you might have heard about are 1024 (2^10=1k) and 65536 (2^16=65k).
Computers are build upon digital electronics, and digital electronics works with states. One fragment can have 2 states, 1 or 0 (if the voltage is above some level then it is 1, if not then it is zero). To represent that behavior binary system was introduced (well not introduced but widely accepted).
So we come to the bit. Bit is the smallest fragment in binary system. It can take only 2 states, 1 or 0, and it represents the atomic fragment of the whole system.
To make our lives easy the byte (8 bits) was introduced. To give u some analogy we don't express weight in grams, but that is the base measure of weight, but we use kilograms, because it is easier to use and to understand the use. One kilogram is the 1000 grams, and that can be expressed as 10 on the power of 3. So when we go back to the binary system and we use the same power we get 8 ( 2 on the power of 3 is 8). That was done because the use of only bits was overly complicated in every day computing.
That held on, so further in the future when we realized that 8 bytes was again too small and becoming complicated to use we added +1 on the power ( 2 on the power of 4 is 16), and then again 2^5 is 32, and so on and the 256 is just 2 on the power of 8.
So your answer is we follow the binary system because of architecture of computers, and we go up in the value of the power to represent get some values that we can simply handle every day, and that is how you got from a bit to an byte (8 bits) and so on!
(2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, and so on) (2^x, x=1,2,3,4,5,6,7,8,9,10 and so on)
The important number here is binary 0 or 1. All your other questions are related to this.
Claude Shannon and George Boole did the fundamental work on what we now call information theory and Boolean arithmetic. In short, this is the basis of how a digital switch, with only the ability to represent 0 OFF and 1 ON can represent more complex information, such as numbers, logic and a jpg photo. Binary is the basis of computers as we know them currently, but other number base computers or analog computers are completely possible.
In human decimal arithmetic, the powers of ten have significance. 10, 100, 1000, 10,000 each seem important and useful. Once you have a computer based on binary, there are powers of 2, likewise, that become important. 2^8 = 256 is enough for an alphabet, punctuation and control characters. (More importantly, 2^7 is enough for an alphabet, punctuation and control characters and 2^8 is enough room for those ASCII characters and a check bit.)
We normally count in base 10, a single digit can have one of ten different values. Computer technology is based on switches (microscopic) which can be either on or off. If one of these represents a digit, that digit can be either 1 or 0. This is base 2.
It follows from there that computers work with numbers that are built up as a series of 2 value digits.
1 digit,2 values
2 digits, 4 values
3 digits, 8 values etc.
When processors are designed, they have to pick a size that the processor will be optimized to work with. To the CPU, this is considered a "word". Earlier CPUs were based on word sizes of fourbits and soon after 8 bits (1 byte). Today, CPUs are mostly designed to operate on 32 bit and 64 bit words. But really, the two state "switch" are why all computer numbers tend to be powers of 2.
I believe the main reason has to do with the original design of the IBM PC. The Intel 8080 CPU was the first precursor to the 8086 which would later be used in the IBM PC. It had 8-bit registers. Thus, a whole ecosystem of applications was developed around the 8-bit metaphor. In order to retain backward compatibility, Intel designed all subsequent architectures to retain 8-bit registers. Thus, the 8086 and all x86 CPUs after that kept their 8-bit registers for backwards compatibility, even though they added new 16-bit and 32-bit registers over the years.
The other reason I can think of is 8 bits is perfect for fitting a basic Latin character set. You cannot fit it into 4 bits, but you can in 8. Thus, you get the whole 256-value ASCII charset. It is also the smallest power of 2 for which you have enough bits into which you can fit a character set. Of course, these days most character sets are actually 16-bit wide (i.e. Unicode).
Charles Petzold wrote an interesting book called Code that covers exactly this question. See chapter 15, Bytes and Hex.
Quotes from that chapter:
Eight bit values are inputs to the
adders, latches and data selectors,
and also outputs from these units.
Eight-bit values are also defined by
switches and displayed by lightbulbs,
The data path in these circuits is
thus said to be 8 bits wide. But
why 8 bits? Why not 6 or 7 or 9 or
10?
... there's really no reason why
it had to be built that way. Eight
bits just seemed at the time to be a
convenient amount, a nice biteful of
bits, if you will.
...For a while, a byte meant simply
the number of bits in a particular
data path. But by the mid-1960s. in
connection with the development of
IBM's System/360 (their large complex
of business computers), the word came
to mean a group of 8 bits.
... One reason IBM gravitated toward
8-bit bytes was the ease in storing
numbers in a format known as BCD.
But as we'll see in the chapters ahead, quite by coincidence a byte is
ideal for storing text because most
written languages around the world
(with the exception of the ideographs
used in Chinese, Japanese and Korean)
can be represented with fewer than 256
characters.
Historical reasons, I suppose. 8 is a power of 2, 2^2 is 4 and 2^4 = 16 is far too little for most purposes, and 16 (the next power of two) bit hardware came much later.
But the main reason, I suspect, is the fact that they had 8 bit microprocessors, then 16 bit microprocessors, whose words could very well be represented as 2 octets, and so on. You know, historical cruft and backward compability etc.
Another, similarily pragmatic reason against "scaling down": If we'd, say, use 4 bits as one word, we would basically get only half the troughtput compared with 8 bit. Aside from overflowing much faster.
You can always squeeze e.g. 2 numbers in the range 0..15 in one octet... you just have to extract them by hand. But unless you have, like, gazillions of data sets to keep in memory side-by-side, this isn't worth the effort.

Resources