Does a program use the same cpu registers everytime it is run? - cpu-registers

When a program is run it uses the various registers eax, ebx etc. to store and move data.
Does a program use the same registers every time it is run?
Can the registers it does or does not use be found?

If it is compiled to machine code, it will use the same registers all the time. If it is interpreted or compiled to byte code (think Java or C#) it can use different registers on each run.

Related

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

Which command_queue to pass to clEnqueueCopyBuffer when launching kernels simultaneously?

So I am implementing a Kmeans clustering algorithm with OpenCL that uses channels: a feature from Intel's FPGA SDK for OpenCL.
To keep it succinct, this means I have two kernels that have to be enqueued on different command queues so they run simultaneously. I want to copy the cl_mem buffer from one kernel to the other every iteration (it's for the 4 clusters, so on the small side), part of which requires me to call clEnqueueCopyBuffer. This requires passing the function a command queue, but I don't know if it wants the queue of the buffer being copied or the queue of the buffer being copied to.
This is all the OpenCL Specification says for the command_queue parameter:
The command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_buffer, and dst_buffer must be the same.
I can confirm these kernels are in fact in the same context.
You could use either command queue but you need to get an event from the copy operation to pass to the other kernel enqueue on the other command queue. Otherwise it might start before the copy finishes.

How many register and what kind of register are available for the storage class REGISTER in c language

Register storage class is used to quicky access the variable and its memory is allocated in CPU. But the registers in the cpu are limited. I use an intel Core i5-4260U Processor. I've visited intel website for the details of the register. But I couldn't find any of such specification of how many registers does the cpu contain
(to visit website click here).
Even if i could find that the number of registers((from How many registers are there in 8086/8088?))
but I couldn't figure out how many of these are used by c storage classes.
But I couldn't find any of such specification of how many registers
does the cpu contain
Just look for "ia32 programming model" or "amd64 programming model".
I couldn't figure out how many of these are used by c storage classes.
That is implementation dependent. A compiler can even ignore that. Some of them use automatic register mapping if invoked with a high level of optimization, regardless of the way the variable has been declared.
For example: the programming model for user-mode applications on IA32 is composed of the registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP and EIP. EAX and EDX are used as accumulators, they are implicit operands for some instructions (MUL, DIV) and they hold the return value of a function. EBP and ESP are reserved for stack and frame management. EIP is the instruction pointer. So this leaves us with EBX, ECX, EDI and ESI for register mapping. Depending upon the code generated, one or more of these registers may be needed, so reducing even more the number of available registers for mapping variables.
The register keyword in C was included because when C was created, compilers did not always do a good job of register allocation. Register allocation is the the part of the compiler which maps program variables to CPU registers.
Nowadays, the algorithms compilers use for register allocation are on the whole excellent. So much so that compilers often ignore the register keyword, reasoning that the compiler knows better then the programmer on how to map registers to maximize performance.
I'm not sure what compiler 'mcleod_ideafix' is referring to when he writes that EAX and EDX are not available for register allocation. The gcc compiler uses 6 integer registers in 32 bit x86 code (EAX, EBX, ECX, EDX, ESI, and EDI). It will even use EBP if the function does not make any function calls and you give the proper compiler option. 64 bit mode adds 8 more registers R8 through R15. If you are using gcc just compile your file with the -S option then look at the generated code to see what registers are used.
Another thing to consider is that Intel processors use a feature called register renaming to reduce the performance penalty of having not enough registers.

Program Counter and Instruction Register

Program counter holds the address of the instruction that should be executed next, while instruction register holds the actual instruction to be executed. wouldn't one of them be enough?
And what is the length of each one of these registers?
Thanks.
You will need both always. The program counter (PC) holds the address of the next instruction to be executed, while the instruction register (IR) holds the encoded instruction. Upon fetching the instruction, the program counter is incremented by one "address value" (to the location of the next instruction). The instruction is then decoded and executed appropriately.
The reason why you need both is because if you only had a program counter and used it for both purposes you would get the following troublesome system:
[Beginning of program execution]
PC contains 0x00000000 (say this is start address of program in memory)
Encoded instruction is fetched from the memory and placed into PC.
The instruction is decoded and executed.
Now it is time to move onto the next instruction so we go back to the PC to see what the address of the next instruction is. However, we have a problem because PC's previous address was removed so we have no idea where the next instruction is.
Therefore, we need another register to hold the actual instruction fetched from memory. Once we fetch that memory, we increase PC so that we know where to fetch the next instruction.
P.S. the width of the registers varies depending on the architecture's word size. For example, for a 32-bit processor, the word size is 32-bits. Therefore, the registers on the CPU would be 32 bits. Instruction registers are no different in dimensions. The difference is in the behavior and interpretation. Instructions are encoded in various forms, however, they still occupy a 32-bit register. For example, the Nios II processor from Altera contains 3 different instruction types, each encoded differently. See page 6 of ftp://ftp.altera.com/up/pub/Tutorials/DE2/Computer_Organization/tut_nios2_introduction.pdf
You can learn more about the Nios II processor's structure from the link above as well. It is a simple IP CPU. Of course Intel has their own specification/design and it will vary.
As you stated, the Program Counter (PC) holds the address of the next instruction to execute, and the Instruction Register (IR) stores the actual instruction to be executed (but not its address).
Related to the lenght of these registers, current machines have 64-bit PCs.
The length of the IR (from a logical point of view) depends on the architecture:
RISC machines usually have fixed-length instructions. For example, most SPARC instructions are encoded in 32-bit formats.
CISC machines (Intel, AMD) have variable-length instructions. For example, see the IntelĀ® 64 and IA-32 Architectures Software Developer Manuals
As these machines are able to fetch, decode and execute several instructions every cycle, the physical implementation of the IR is not easy to describe in a few lines.

POSIX Threads: are pthreads_cond_wait() and others systemcalls?

The POSIX standard defines several routines for thread synchronization, based on concepts like mutexes and conditional variables.
my question is now: are these (like e.g. pthreads_cond_init(), pthreads_mutex_init(), pthreads_mutex_lock()... and so on) system calls or just library calls? i know they are included via "pthread.h", but do they finally result in a system call and therefore are implemented in the kernel of the operating system?
On Linux a pthread mutex makes a "futex" system call, but only if the lock is contended. That means that taking a lock no other thread wants is almost free.
In a similar way, sending a condition signal is only expensive when there is someone waiting for it.
So I believe that your answer is that pthread functions are library calls that sometimes result in a system call.
Whenever possible, the library avoids trapping into the kernel for performance reasons. If you already have some code that uses these calls you may want to take a look at the output from running your program with strace to better understand how often it is actually making system calls.
I never looked into all those library call , but as far as I understand they all involve kernel operations as they are supposed to provide synchronisations between process and/or threads at global level - I mean at the OS level.
The kernel need to maintain for a mutex, for instance, a thread list: threads that are currently sleeping, waiting that a locked mutex get released. When the thread that currently lock/owns that mutex invokes the kernel with pthread_mutex_release(), the kernel system call will browse that aforementioned list to get the higher priority thread that is waiting for the mutex release, flag the new mutex owner into the mutex kernel structure, and then will give away the cpu (aka "ontect switch") to the newly owner thread, thus this process will return from the posix library call pthread_mutex_lock().
I only see a cooperation with the kernel when it involves IPC between processes (I am not talking between threads at a single process level). Therefore I expect those library call to invoke the kernel, so.
When you compile a program on Linux that uses pthreads, you have to add -lphtread to the compiler options. by doing this, you tell the linker to link libpthreads. So, on linux, they are calls to a library.

Resources