I'm learning about registers. It looks like 32-bit registers are divided up so that they can be accessed as 8-bit registers. This looks very inefficient. Performance would be improved if they didn't do this. So why do they do it?
Also, it costs extra money to design them like this. Why not make the CPU cheaper by not doing it?
Because if you're only dealing with 8bit values, it'd be inefficient to have issue all the bitmasks to limit those 32/64bit register to just the 8bits you're working on.
So, x86 registers have
AH/AL = high/low 8bits of a 16bit register
AX = whole 16bit register
EAX = whole 32bit register
It's far more efficient, in terms of instruction size to have
mov ah, 0xXX (2 bytes)
rather than forcing
mov ax, 0x00XX (3 bytes)
mov eax, 0x000000XX (7 bytes)
As for "designing the cpu to make it cheaper" - it's for backwards compatibility. All modern x86 processors are actually internally a RISC design, with a major chunk of silicon dedicated to taking the x86 instructions coming in and converting them into the CPU's own internal micro-ops (which is basically a RISC instruction set).
The Intel 8080, which was the first "mainstream" microprocessor, had seven main 8-bit registers (A, B, C, D, E, H, and L). Because memory addresses were 16 bits, instructions that needed to use a non-constant memory operand would use a pair of registers (most commonly H and L, but sometimes B and C, or D and E) to form the address. Because the registers in the aforementioned pairs were often used together to represent 16-bit values, there were a few instructions which could operate upon the register pairs as 16-bit quantities. An instruction to add BC to HL would perform the addition by adding C to L, and then by adding B to H (plus a carry if needed). I'm not familiar enough with the 4004 or 8008 (the two predecessors of the 8080) to know if either of them did anything similar in its architecture.
When Intel produced the 8088, they included a full 16-bit arithmetic unit, but they wanted code which was written for the 8080 to be easily convertible to their new architecture. On the 8080, a lot of code had been written to "manually" form addresses out of the 8-bit parts, since doing so was often much faster than using the 16-bit instructions to do the math. For example, if one needed to access some specified table of 256 entries with an index stored in A, one could have done something like (Zilog notation show, but the 8080 had the same instructions):
ld hl,(baseOfTable) ; 16-bit address
ld c,a
ld b,#0
add hl,bc
ld a,(hl)
but if one could make certain the table was aligned on a 256-byte boundary, one could simplify the code considerably:
ld l,a
ld a,(tableBaseMSB) ; Just load the MSB--assume the LSB is zero
ld h,a
ld a,(hl)
With the 8088 instruction set, it wouldn't terribly often be useful for code written "from scratch" to access the upper and lower parts of registers separately, but there was a lot of code written for the 8080 which used such techniques, and Intel wanted to make it easy for people to convert such code for use on the 8088. Allowing registers to be built from 8-bit pieces was helpful in that regard.
Incidentally, there was another advantage to Intel's architecture: since it included four 16-bit only registers and four registers which could be used as either one 16-bit or two 8-bit parts, that made it possible for code to hold 12 values in registers if eight of them were 255 or less, or eleven values if six of them were 256 or less, etc. When using architectures with more registers, eking out an extra register here and there isn't quite so important, but on the 8088 it was often very helpful.
The ability to address portions of the registers has no effect on their performance when used as 32-bit registers. In that case, this capability just isn't used.
CPUs, regardless of their native bit size, need to manipulate 8-bit values very, very often. Strings of text, for example, are frequently manipulated as consecutive 8-bit values. International character sets are often manipulated as sets of consecutive 16-bit values. So being able to operate rapidly on 8-bit and 16-bit values is of tremendous importance.
If you're asking as a practical matter for x86 CPUs, it's too late. The very first PC CPUs didn't even have 32-bit registers, and compatibility has been retained all the way through.
Backwards compatibility. Processor manufacturers did not wanted to break compatibility with old software. This is the main reason why x86_64 processors still support 16bit software(virtual mode). If you look closely you'll see that majority of the features in x86 architecture are shaped by compatibility concerns. I'm no hating.
Related
On common 64bit architectures like x86-64 and arm64, usually only 48 bits are used for memory addressing, while the other bits are copies of bit 47 (which is usually zero for user-space programs). Thus, the remaining 16 bits can be used to store additional data like type tags etc., as long as those bits are masked off before dereferencing. Alternatively, the 48 bits can fit into the NaN-representation of a 64-bit float number. Both techniques are often used by dynamic/interpreted languages.
I've read about Intel 5-level-paging which would extend the address range from 48 to 57 bits, thus significantly reducing the leftover bits and also rendering NaN-boxing impossible. The Linux Kernel has already added support for this paging scheme.
Given that 48 bits correspond to 262,144 GiB of memory we can assume that we won't need the 57 bit range anytime soon on consumer devices like PCs, laptops and phones, and thus one might assume that on those devices we will remain on the 48 bit mode for a long time to come, with the above mentioned techniques remaining viable, while the 57 bit mode will be only used for servers/supercomputers.
Am I correct to make those assumptions? Or are there indicators that even consumer-scale devices will use the 57 bit mode in the near future?
Even if memory-mapped persistent storage becomes widespread (NV-DIMM), it'll be a while before consumer PCs have more than 64TiB or 128TiB of storage + DRAM. Remember that high-half kernels want half the virtual address space for kernel use, and typically want to direct-map all physical memory to a bit contiguous range of virtual addresses. As well as making other mappings in kernel space, I think. e.g. see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for what Linux does.
As you suspect, OSes wouldn't actually enable PML5 on computers that have far less than 256TiB of physical address space. There's no need for that much virtual address space and it has a performance cost (more expensive page-walks from another level of page tables). The page-walk hardware wouldn't always be able to keep the two actually-used top-level entries cached; invalidations of everything on CR3 changes can force flushing. (Page-walk hardware can in general cache upper levels of the radix tree to speed up TLB misses for nearby pages.)
avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256).
They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical.
So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a processor will ever exist that supports avx2 but not avx.
There's a bit of a disconnect between the intrinsics and the actual instructions that are underneath.
AVX:
All 3 of these generate exactly the same instruction, vperm2f128:
_mm256_permute2f128_pd()
_mm256_permute2f128_ps()
_mm256_permute2f128_si256()
The only difference are the types - which don't exist at the instruction level.
vperm2f128 is a 256-bit floating-point instruction. In AVX, there are no "real" 256-bit integer SIMD instructions. So even though _mm256_permute2f128_si256() is an "integer" intrinsic, it's really just syntax sugar for this:
_mm256_castpd_si256(
_mm256_permute2f128_pd(
_mm256_castsi256_pd(x),
_mm256_castsi256_pd(y),
imm
)
);
Which does a round trip from the integer domain to the FP domain - thus incurring bypass delays. As ugly as this looks, it is only way to do it in AVX-only land.
vperm2f128 isn't the only instruction to get this treatment, I find at least 3 of them:
vperm2f128 / _mm256_permute2f128_si256()
vextractf128 / _mm256_extractf128_si256()
vinsertf128 / _mm256_insertf128_si256()
Together, it seems that the usecase of these intrinsics is to load data as 256-bit integer vectors, and shuffle them into multiple 128-bit integer vectors for integer computation. Likewise the reverse where you store as 256-bit vectors.
Without these "hack" intrinsics, you would need to use a lot of cast intrinsics.
Either way, a competent compiler will try to optimize the types as well. Thus it will generate floating-point load/stores and shuffles even if you are using 256-bit integer loads. This reduces the number of bypass delays to only one layer. (when you go from FP-shuffle to 128-bit integer computation)
AVX2:
AVX2 cleans up this madness by adding proper 256-bit integer SIMD support for everything - including the shuffles.
The vperm2i128 instruction is new along with a new intrinsic for it, _mm256_permute2x128_si256().
This, along with _mm256_extracti128_si256() and _mm256_inserti128_si256() lets you do 256-bit integer SIMD and actually stay completely in the integer domain.
The distinction between integer FP versions of the same instructions has to do with bypass delays. In older processors, there were delays to move data from int <-> FP domains. While the SIMD registers themselves are type-agnostic, the hardware implementation isn't. And there is extra latency to get data output by an FP instruction to an input to an integer instruction. (and vice versa)
Thus it was important (from a performance standpoint) to use the correct instruction type to match the actual datatype that was being operated on.
On the newest processors (Skylake and later?), there doesn't seem to be anymore int/FP bypass delays with regards to the shuffle instructions. While the instruction set still has this distinction, shuffle instructions that do the same thing with different "types" probably map to the same uop now.
The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?
The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.
So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.
I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.
dst[ 0:63] = a[64:127]
dst[64:127] = b[0:63]
Equivalent to:
__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);
or
__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));
Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.
Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.
Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)
So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)
For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.
If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.
Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.
I've got to learn assembly and I'm very confused as to what the different registers do/point to.
On some architectures, like MIPS, all registers are created equal, and there is really no difference beyond the name of the register (and software conventions). On x86 you can mostly use any registers for general-purpose computing, but some registers are implicitly bound to the instruction set.
Lots of information about special purposes for registers can be found here.
Examples:
eax, accumulator: many arithmetic instructions implicitly operate on eax. There are also special shorter EAX-specific encodings for many instructions: add eax, 123456 is 1 byte shorter than add ecx, 123456, for example. (add eax, imm32 vs. add r/m32, imm32)
ebx, base: few implicit uses, but xlat is one that matches the "Base" naming. Still relevant: cmpxchg8b. Because it's rarely required for anything specific, some 32-bit calling-conventions / ABIs use it as a pointer to the "global offset table" in Position Independent Code (PIC).
edx, data: some arithmetic operations implicitly operate on the 64-bit value in edx:eax
ecx, counter used for shift counts, and for rep movs. Also, the mostly-obsolete loop instruction implicitly decrements ecx
esi, source index: some string operations read a string from the memory pointed to by esi
edi, destination index: some string operations write a string to the memory pointed to by edi. e.g. rep movsb copies ECX bytes from [esi] to [edi].
ebp, base pointer: normally used to point to local variables. Used implicitly by leave.
esp, stack pointer: points to the top of the stack, used implicitly by push, pop, call and ret
The x86 instruction set is a complex beast, really. Many instructions have shorter forms that implicitly use one register or another. Some registers can be used to do certain addressing while others cannot.
The Intel 80386 Programmer's Reference Manual is a irreplaceable resource, it basically tells you everything there is to know about x86 assembly, except for newer extensions and performance on modern hardware.
The PC Assembly (e)book is a great resource for learning assembly.
The sp register is the stack pointer, used for stack operation like push and pop.
The stack is known as a LIFO structure (last-in, first-out), meaning that the last thing pushed on is the fist thing popped off. It's used, among other things, to implement the ability to call functions.
The bp register is the base pointer, and is commonly used for stack frame operations.
This means that it's a fixed reference to locate local variables, passed parameters and so forth on the stack, for a given level (while sp may change during the execution of a function, bp usually does not).
If you're looking at assembly language like:
mov eax, [bp+8]
you're seeing the code access a stack-level-specific variable.
The si register is the source index, typically used for mass copy operations (di is its equivalent destination index). Intel had these registers along with specific instructions for quick movement of bytes in memory.
The e- variants are just the 32-bit versions of these (originally) 16-bit registers. And, as if that weren't enough, we have 64-bit r- variants as well :-)
Perhaps the simplest place to start is here. It's specific to the 8086 but the concepts haven't changed that much. The simplicity of the 8086 compared to the current crop will be a good starting point for your education. Once you've learned the basics, it will be much easier to move up to the later members of the x86 family.
Transcribed here and edited quite a bit, to make the answer self-contained.
GENERAL PURPOSE REGISTERS
8086 CPU has 8 general purpose registers, each register has its own name:
AX - the accumulator register (divided into AH/AL). Probably the most commonly used register for general purpose stuff.
BX - the base address register (divided into BH/BL).
CX - the count register (divided into CH/CL). Special purpose instructions for loping and shifting.
DX - the data register (divided into DH/DL). Used with AX for some MUL and DIV operations, and for specifying ports in some IN and OUT operations.
SI - source index register. Special purpose instruction to use this as a source of mass memory transfers (DS:SI).
DI - destination index register. Special purpose instruction to use this as a destination of mass memory transfers (ES:DI).
BP - base pointer, primarily used for accessing parameters and variables on the stack.
SP - stack pointer, used for the basic stack operations.
SEGMENT REGISTERS
CS - points at the segment containing the current instruction.
DS - generally points at segment where variables are defined.
ES - extra segment register, it's up to a coder to define its usage.
SS - points at the segment containing the stack.
Although it is possible to store any data in the segment registers, this is never a good idea. The segment registers have a very special purpose - pointing at accessible blocks of memory.
Segment registers work together with general purpose register to access any memory value. For example, if we would like to access memory at the physical address 12345h, we could set the DS = 1230h and SI = 0045h. This way we can access much more memory than with a single register, which is limited to 16 bit values.
The CPU makes a calculation of the physical address by multiplying the segment register by 10h and adding the general purpose register to it (1230h * 10h + 45h = 12345h):
1230
0045
=====
12345
The address formed with 2 registers is called an effective address.
This usage is for real mode only (which is the only mode the 8086 had). Later processors changed these registers from segments to selectors and they are used to lookup addresses in a table, rather than having a fixed calculation performed on them.
By default BX, SI and DI registers work with DS segment register; and BP and SP work with SS segment register.
SPECIAL PURPOSE REGISTERS
IP - the instruction pointer:
Always points to next instruction to be executed.
Offset address relative to CS.
IP register always works together with CS segment register and it points to currently executing instruction.
FLAGS REGISTER
Determines the current state of the processor. These flags are modified automatically by CPU after mathematical operations, this allows to determine the type of the result, and to determine conditions to transfer control to other parts of the program.
Generally you cannot access these registers directly.
Carry Flag CF - this flag is set to 1 when there is an unsigned overflow. For example when you add bytes 255 + 1 (result is not in range 0...255). When there is no overflow this flag is set to 0.
Parity Flag PF - this flag is set to 1 when there is even number of one bits in result, and to 0 when there is odd number of one bits.
Auxiliary Flag AF - set to 1 when there is an unsigned overflow for low nibble (4 bits).
Zero Flag ZF - set to 1 when result is zero. For non-zero result this flag is set to 0.
Sign Flag SF - set to 1 when result is negative. When result is positive it is set to 0. (This flag takes the value of the most significant bit.)
Trap Flag TF - Used for on-chip debugging.
Interrupt enable Flag IF - when this flag is set to 1 CPU reacts to interrupts from external devices.
Direction Flag DF - this flag is used by some instructions to process data chains, when this flag is set to 0 - the processing is done forward, when this flag is set to 1 the processing is done backward.
Overflow Flag OF - set to 1 when there is a signed overflow. For example, when you add bytes 100 + 50 (result is not in range -128...127).
Here's a simplified summary:
ESP is the current stack pointer, so you generally only update it to manipulate stack, and EBP is intended for stack manipulation too, for example saving the value of ESP before allocating stack space for local variables. But you can use EBP as a general purpose register too.
ESI is the Extended Source Index register, "string" (different from C-string, and I don't mean the type of C-string women wear either) instructions like MOVS use ESI and EDI.
Memory Addressing:
x86 CPUs have these special registers called "segment registers", each of them can point to different address, for example DS (commonly called data segment) may point to 0x1000000, and SS (commonly called stack segment) may point to 0x2000000.
When you use EBP and ESP, the default segment register used is SS, for ESI (and other general purpose registers), it's DS. For example, let's say DS=0x1000000, SS=0x2000000, EBP=0x10, ESI=0x10, so:
mov eax,[esp] //loading from address 0x2000000 + 0x10
mov eax,[esi] //loading from address 0x1000000 + 0x10
You can also specify a segment register to use, overriding the default:
mov eax,ds:[ebp]
In terms of addition, subtraction, logical operations, etc, there's no real difference between them.