Program Counter and Instruction Register - cpu-registers

Program counter holds the address of the instruction that should be executed next, while instruction register holds the actual instruction to be executed. wouldn't one of them be enough?
And what is the length of each one of these registers?
Thanks.

You will need both always. The program counter (PC) holds the address of the next instruction to be executed, while the instruction register (IR) holds the encoded instruction. Upon fetching the instruction, the program counter is incremented by one "address value" (to the location of the next instruction). The instruction is then decoded and executed appropriately.
The reason why you need both is because if you only had a program counter and used it for both purposes you would get the following troublesome system:
[Beginning of program execution]
PC contains 0x00000000 (say this is start address of program in memory)
Encoded instruction is fetched from the memory and placed into PC.
The instruction is decoded and executed.
Now it is time to move onto the next instruction so we go back to the PC to see what the address of the next instruction is. However, we have a problem because PC's previous address was removed so we have no idea where the next instruction is.
Therefore, we need another register to hold the actual instruction fetched from memory. Once we fetch that memory, we increase PC so that we know where to fetch the next instruction.
P.S. the width of the registers varies depending on the architecture's word size. For example, for a 32-bit processor, the word size is 32-bits. Therefore, the registers on the CPU would be 32 bits. Instruction registers are no different in dimensions. The difference is in the behavior and interpretation. Instructions are encoded in various forms, however, they still occupy a 32-bit register. For example, the Nios II processor from Altera contains 3 different instruction types, each encoded differently. See page 6 of ftp://ftp.altera.com/up/pub/Tutorials/DE2/Computer_Organization/tut_nios2_introduction.pdf
You can learn more about the Nios II processor's structure from the link above as well. It is a simple IP CPU. Of course Intel has their own specification/design and it will vary.

As you stated, the Program Counter (PC) holds the address of the next instruction to execute, and the Instruction Register (IR) stores the actual instruction to be executed (but not its address).
Related to the lenght of these registers, current machines have 64-bit PCs.
The length of the IR (from a logical point of view) depends on the architecture:
RISC machines usually have fixed-length instructions. For example, most SPARC instructions are encoded in 32-bit formats.
CISC machines (Intel, AMD) have variable-length instructions. For example, see the IntelĀ® 64 and IA-32 Architectures Software Developer Manuals
As these machines are able to fetch, decode and execute several instructions every cycle, the physical implementation of the IR is not easy to describe in a few lines.

Related

Hex file verification inside microcontroller

As we all know, the hex file is the heart of our application code which will be programmed into the microcontroller's flash memory for execution. My doubt is before the execution of this hex file, will it be verified by a microcontroller or it will just execute once all start-up processes finished?
Disclaimer: Because I don't know all microcontrollers, this is not a complete answer.*
The flashed binary executable will just be executed.
Some microcontrollers check for a certain value at a fixed address to decide whether to start the built-in bootloader or a flashed user program.
If you need the user program to be checked, you will need to implement this yourself. I have worked with such systems, it is quite common, especially in safety-related environments.
Concerning the format of hex files:
Intelhex as well as other format like SREC are human readable text representations of binary data. The common reason for the checksums in these formats is to ensure data consistency during transmission, which was done via unreliable channels back at the time when the formats were invented.
Another advantage is the limitation to 7-bit ASCII characters that can be transferred losslessly via old internet protocols.
However, the "real" contents, the binary data, is stored directly in the flash memory of the microcontrollers. Checksums might be used by the receiving software (for example the bootloader) in the microcontroller when the user program is to be flashed. But after flashing they are gone.

RISC-V 32/64-bit compatibility issues

Suppose you take an RV32 program and try running it on a 64-bit system, what compatibility issues are likely to arise?
As I understand it, the instruction encoding is the same, and on RISC-V (like other modern RISC architectures, though unlike x86), ALU operations automatically operate on whatever the word size is, so if you add the contents of a pair of registers, you will get a 32-bit or 64-bit addition as applicable. Load and store of course work on an explicitly specified size because they depend on how many bytes have been allocated in memory.
One theoretically possible compatibility issue would arise if the code depends on bits past 32 being discarded, e.g. add 2^31 to itself and compare the result with zero.
Another more practical issue would arise if the operating system supplies memory addresses outside the first 4 gigabytes, which would be garbled when the code stores the addresses in 32-bit variables.
Are there any other issues I am missing?
You are correct about both of those possible compatibility issues.
Additionally, some Control and Status Registers (namely cycleh, instreth, timeh) are not needed in RV64I and therefore don't exist. Any code which tries to access them should error.
However, there are instructions to use only the lower 32 bits for ALU operations. Which could potentially be changed by replacing the opcode and funct3 in the binary.
So with an operating system mode which returns only 32-bit addresses, it would be possible to replace a binary with a working 64 bit version so long as cycleh and friends aren't used.
References RISC-V Specification v2.2:
Chapter 4 of the RISC-V Spec. v2.2 outlines the differences from RV32I to RV64I.
Chapter 2.8 goes over Control and Status Registers
Table 19.3 lists all of the CSRs in the standard.

How are branch mispredictions handled before a hardware interrupt

A hardware interrupt occurs to a particular vector (not masked), CPU checks IF flag and pushes RFLAGS, CS and RIP to the stack, meanwhile there are still instructions completing in the back end, one of these instruction's branch predictions turns out to be wrong. Usually the pipeline would be flushed and the front end starts fetching from the correct address but in this scenario an interrupt is in progress.
When an interrupt occurs, what happens to instructions in the pipeline?
I have read this and clearly a solution is to immediately flush everything from the pipeline so that this doesn't occur and then generate the instructions to push the RFLAGS, CS, RIP to the location of the kernel stack in the TSS; however, the question arises, how does it know the (CS:)RIP associated with the most recent architectural state in order to be able to push it on the stack (given that the front end RIP would now be ahead). This is similar to the question of how the taken branch execution unit on port0 knows the (CS:)RIP of what should have been fetched when the take prediciton turns out to be wrong -- is the address encoded into the instruction as well as the prediction? The same issue arises when you think of a trap / exception, the CPU needs to push the address of the current instruction (fault) or the next instruction (trap) to the kernel stack, but how does it work out the address of this instruction when it is halfway down the pipeline -- this leads me to believe that the address must be encoded into the instruction and is worked out using the length information and this is possibly all done at predecode stage..
The CPU will presumably discard the contents of the ROB, rolling back to the latest retirement state before servicing the interrupt.
An in-flight branch miss doesn't change this. Depending on the CPU (older / simpler), it might have already been in the process of rolling back to retirement state and flushing because of a branch miss, when the interrupt arrived.
As #Hadi says, the CPU could choose at that point to retire the branch (with the interrupt pushing a CS:RIP pointing to the correct branch target), instead of leaving it to be re-executed after returning from the interrupt.
But that only works if the branch instruction was already ready to retire: there were no instructions older than the branch still not executed. Since it's important to discover branch misses as early as possible, I assume branch recovery starts when it discovers a mispredict during execution, not waiting until it reaches retirement. (This is unlike other kinds of faults: e.g. Meltdown and L1TF are based on a faulting load not triggering #PF fault handling until it reaches retirement so the CPU is sure there really is a fault on the true path of execution. You don't want to start an expensive pipeline flush until you're sure it wasn't in the shadow of a mispredict or earlier fault.)
But since branch misses don't take an exception, redirecting the front-end can start early before we're sure that the branch instruction is part of the right path in the first place.
e.g. cmp byte [cache_miss_load], 123 / je mispredicts but won't be discovered for a long time. Then in the shadow of that mispredict, a cmp eax, 1 / je on the "wrong" path runs and a mispredict is discovered for it. With fast recovery, uops past that are flushed and fetch/decode/exec from the "right" path can start before the earlier mispredict is even discovered.
To keep IRQ latency low, CPUs don't tend to give in-flight instructions extra time to retire. Also, any retired stores that still have their data in the store buffer (not yet committed to L1d) have to commit before any stores by the interrupt handler can commit. But interrupts are serializing (I think), and any MMIO or port-IO in a handler will probably involve a memory barrier or strongly-ordered store, so letting more instructions retire can hurt IRQ latency if they involve stores. (Once a store retires, it definitely needs to happen even while its data is still in the store buffer).
The out-of-order back-end always knows how to roll back to a known-good retirement state; the entire contents of the ROB are always considered speculative because any load or store could fault, and so can many other instructions1. Speculation past branches isn't super-special.
Branches are only special in having extra tracking for fast recovery (the Branch Order Buffer in Nehalem and newer) because they're expected to mispredict with non-negligible frequency during normal operation. See What exactly happens when a skylake CPU mispredicts a branch? for some details. Especially David Kanter's quote:
Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.
(This answer is intentionally very Intel-centric because you tagged it intel, not x86. I assume AMD does something similar, and probably most out-of-order uarches for other ISAs are broadly similar. Except that memory-order mis-speculation isn't a thing on CPUs with a weaker memory model where CPUs are allowed to visibly reorder loads.)
Footnote 1: So can div, or any FPU instruction if FP exceptions are unmasked. And a denormal FP result could require a microcode assist to handle, even with FP exceptions masked like they are by default.
On Intel CPUs, a memory-order mis-speculation can also result in a pipeline nuke (load speculatively done early, before earlier loads complete, but the cache lost its copy of the line before the x86 memory model said the load could take its value).
In general, each entry in the the ReOrder Buffer (ROB) has a field that is used to store enough information about the instruction address to reconstruct the whole instruction address unambiguously. It may be too costly to store the whole address for each instruction in the ROB. For instructions that have not yet been allocated (i.e., not yet passed the allocation stage of the pipeline), they need to carry this information with them at least until they reach the allocation stage.
If an interrupt and a branch misprediction occur at the same time, the proessor may, for example, choose to service the interrupt. In this case, all the instructions that are on the mispredicted path need to be flushed. The processor may choose also to flush other instructions that are on the correct path, but have not yet retired. All of these instructions are in the ROB and their instruction addresses are known. For each speculated branch, there is a tag that identifies all instructions on that speculated path and all instructions on this path are tagged with it. If there is another, later speculated branch, another tag is used, but it is also ordered with respect to the previous tag. Using these tags, the processor can determine exactly which instructions to flush when any of the speculated branches turns out to be incorrect. This is determined after the corresponding branch instruction completes execution in the branch execution unit. Branches may complete execution out of order. When the correct address of a msipredicted branch is calculated, it's forwarded to the fetch unit and the branch prediction unit (BPU). The fetch unit uses it to fetch instructions from the correct path and the BPU uses it to update its prediction state.
The processor can choose to retire the mispredicted branch instruction itself and flush all other later instructions. All rename registers are reclaimed and those physical registers that are mapped to architectural registers at the point the branch is retired are retained. At this point, the processor executes instructions to save the current state and then begins fetching instructions of the interrupt handler.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

On what parameters boot sequence varies?

Does every Unix flavor have same boot sequence code ? I mean there are different kernel version releases going on for different flavors, so is there possibility of different code for boot sequence once kernel is loaded? Or they keep their boot sequence (or code) common always?
Edit: I want to know into detail how boot process is done.
Where does MBR finds a GRUB? How this information is stored? Is it by default hard-coded?
Is there any block level partion architecture available for boot sequence?
How GRUB locates the kernel image? Is it common space, where kernel image is stored?
I searched a lot on web; but it shows common architecture BIOS -> MBR -> GRUB -> Kernel -> Init.
I want to know details of everything. What should I do to know this all? Is there any way I could debug boot process?
Thanks in advance!
First of all, the boot process is extremely platform and kernel dependent.
The point is normally getting the kernel image loaded somewhere in memory and run it, but details may differ:
where do I get the kernel image? (file on a partition? fixed offset on the device? should I just map a device in memory?)
what should be loaded? (only a "core" image? also a ramdisk with additional data?)
where should it be loaded? Is additional initialization (CPU/MMU status, device initialization, ...) required?
are there kernel parameters to pass? Where should they be put for the kernel to see?
where is the configuration for the bootloader itself stored (hard-coded, files on a partition, ...)? How to load the additional modules? (bootloaders like GRUB are actually small OSes by themselves)
Different bootloaders and OSes may do this stuff differently. The "UNIX-like" bit is not relevant, an OS starts being ostensibly UNIXy (POSIX syscalls, init process, POSIX userland,...) mostly after the kernel starts running.
Even on common x86 PCs the start differs deeply between "traditional BIOS" and UEFI mode (in this last case, the UEFI itself can load and start the kernel, without additional bootloaders being involved).
Coming down to the start of a modern Linux distribution on x86 in BIOS mode with GRUB2, the basic idea is to quickly get up and running a system which can deal with "normal" PC abstractions (disk partitions, files on filesystems, ...), keeping at minimum the code that has to deal with hardcoded disk offsets.
GRUB is not a monolithic program, but it's composed in stages. When booting, the BIOS loads and executes the code stored in the MBR, which is the first stage of GRUB. Since the amount of code that can be stored there is extremely limited (few hundred bytes), all this code does is to act as a trampoline for the next GRUB stage (somehow, it "boots GRUB");
the MBR code contains hard-coded the address of the first sector of the "core image"; this, in turn, contains the code to load the rest of the "core image" from disk (again, hard-coded as a list of disk sectors);
Once the core image is loaded, the ugly work is done, since the GRUB core image normally contains basic file system drivers, so it can load additional configuration and modules from regular files on the boot partition;
Now what happens depends on the configuration of the specific boot entry; for booting Linux, usually there are two files involved: the kernel image and the initrd:
initrd contains the "initial ramdrive", containing the barebones userland mounted as / in the early boot process (before the kernel has mounted the filesystems); it mostly contains device detection helpers, device drivers, filesystem drivers, ... to allow the kernel to be able to load on demand the code needed to mount the "real" root partition;
the kernel image is a (usually compressed) executable image in some format, which contains the actual kernel code; the bootloader extracts it in memory (following some rules), puts the kernel parameters and initrd memory position in some memory location and then jumps to the kernel entrypoint, whence the kernel takes over the boot process;
From there, the "real" Linux boot process starts, which normally involves loading device drivers, starting init, mounting disks and so on.
Again, this is all (x86, BIOS, Linux, GRUB2)-specific; points 1-2 are different on architectures without an MBR, and are are skipped completely if GRUB is loaded straight from UEFI; 1-3 are different/avoided if UEFI (or some other loader) is used to load directly the kernel image. The initrd thing may be not involved if the kernel image already bundles all that is needed to start (typical of embedded images); details of points 4-5 are different for different OSes (although the basic idea is usually similar). And, on embedded machines the kernel may be placed directly at a "magic" location that is automatically mapped in memory and run at start.

Resources