Stopping runaway OpenCL kernel - opencl

I accidentally wrote a while loop that would never break in a kernel and I sent this to the GPU. After 30 seconds my screens started flickering, I realised what I have done and terminated the application by force. The problem is that I had to shut down the computer afterwards to make sure the kernels are gone. Therefore my questions are:
If I forcefully terminate the program (the program that's launching the kernels) without it freeing the GPU resources (freeing buffers, queues, kernels, CL.destroying) will the kernels still run?
If they are still running can I do anything to stop them? Say, like, release resources I don't have a handle to any more.

If you are using an NVIDIA card, then by terminating the application you will eventually free the resources on the card to allow it to run again. This is because NVIDIA has a watchdog monitor on the device (which you can turn off).
If you are using an AMD card, you are out of luck AFAIK and will have to restart the machine after every crash.

Related

Which program take care of loading from Flash to RAM and running program in Microcontroller Bare metal?

The program written in C and compiled on some other IDE/computer (or cross-compiling) and then loaded as binary data into the flash memory of the controller.
What am i not understanding in Bare Metal / No RTOS
Which program/code take care of loading from Flash to RAM?
Is the RAM in microcontroller have intelligence/program to understand binary or at time of compile the intelligence is added to the binary file by compiler?
Ideally your program runs in flash not ram. Many mcus you can, it would be an architecture limit primarily if running from ram is not supported. In a pinch you can run your code in ram if you need a trampoline to reprogram the flash as in downloading new firmware in the field (for a chip with only one flash bank that can't run and be erased/modified at the same time), or for performance, but if you need ram for performance then perhaps you need to rethink your design. small sections sure, but if the whole app has to be in ram for reasons other than development, you need to re-think your system design.
You can easily wrap your program with a small copy to ram bit of code, so that the mcu boots up the copy and jump program and then the main application runs in ram. that is your choice. somewhat trivial just a few lines of code. it is chip/architecture dependent on whether you can handle interrupts in that situation or how you need to design it (more than just a copy and jump for example, might need handlers in flash that hop over to ram too).
There is no magic here, the mcu processor is no different than others you need some non-volatile way to get the program in there. Like most others cpus your processor boots from a rom/flash, then as desired it works toward the final application be it an operating system or not. for an mcu the typical approach is to boot right into the application, run the application in flash for read only items (.text and .rodata) and the read-write in ram (.data, .bss) which is handled by knowing how to use your toolchain, which is a critical part of bare-metal success.
CPUs generally don't care about flash, ram, peripherals, they are just addresses, the cpu is very very dumb. You the programmer are smart you lay the tracks down for the cpu to follow, the instructions have to follow the rules and guide the processor. The processor starts in a well known way at a well known address or vector table, from there it is all on you to keep the processor on track by working within the address space where there are resources, flash, ram and peripherals. The processor may have rules on the address space it can fetch/execute from, or not, depends on the implementation. For implementations where the executable address space has both flash and ram then yes you can simply place code in ram and execute it.
Running code in ram on an mcu is the exception not the rule.
Commonly a microcontroller does not load the (single) program into RAM. Instead it is run "in-place" in the (flash or any other non-volatile) memory. The program is built so that the memory at the (fixed) start address contains the startup code of the program.
Having said that you might wonder how (static) variables are initialized with zero and non-zero values. That is done by the startup code linked in when the program is built.
There is no need to add any "intelligence", assuming you mean something like a byte-code interpreter to execute the binary commands. The CPU of the microcontroller executes the machine code directly. And your compiler generates exactly the machine code.

When a process makes a system call to transmit a TCP packet over the network, which of the following steps do NOT occur always?

I am teaching myself OS by going through the lecture notes of the course at IIT Bombay (https://www.cse.iitb.ac.in/~mythili/os/). One of the questions in the Process worksheet asks which of the following doesn't always happen in the situation described at the title. The answer is C.
A. The process moves to kernel mode.
B. The program counter of the CPU shifts to the kernel part of the address space.
C. The process is context-switched out and a separate kernel process starts execution.
D. The OS code that deals with handling TCP/IP packets is invoked
I'm a bit confused though. I thought when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time. The kernel, then, will take care of the packet sending. Why would C not be correct then?
You are right in saying that "when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time", but the words "generally or mostly" need to be added to it.
In most cases, there is another process waiting for CPU time and that can be scheduled. However it is not the case 100% of the time. The question is about the word "always" and while other options always occur in the given situation, option C is a choice that OS makes at run time. If OS determines that switching out this process can be sub optimal than performing the system call and resuming the same process, then it may not perform the context switching.
There is a cost associated with context switching and if other processes are also blocked on some I/O then it may be optimal for OS to NOT switch the context or there might be other reasons to not switch the context such as what if only 1 process is running, there is no other process to switch the context to!

Is there any way to terminate a kernel on AMD GPU with OpenCL?

Recently, I wrote OpenCL program to use AMD GPU,However, as I'm new to this, some problems I cannot detect directly will cause launched kernel dead, and clinfo also display nothing in this condition. Is there any method to kill a running kernel on AMD GPU ? Each time kernel is dead, rebooting is what I depend on to fix it for now.
There's no standard way to stop a kernel which has begun executing. You should write your kernels so that they stop predictably - i.e. don't use infinite loops, just repeatedly enqueue the kernels from the host instead.
One thing i know can hang some AMD GPUs, is having a barrier that's only executed by some work-items of a workgroup. Note that this is not a bug in the GPU, the OpenCL specification explicitly says this about barrier():
This function must be encountered by all work-items in a work-group executing the kernel.
So either all, or none of the work items must encounter a barrier. If only some do, your program can hang on the GPU. So i'd check the OpenCL code for things like this:
if (localmem[i] > 0) {
barrier(CLK_LOCAL_MEM_FENCE);
}
You could also try using Oclgrind on your code, it's an OpenCL GPU emulator that's useful to find problems in OpenCL code.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

How can a code be asyncronus on a single-core CPU which is synchronous?

In a uniprocessor (UP) system, there's only one CPU core, so only one thread of execution can be happening at once. This thread of execution is synchronous (it gets a list of instructions in a queue and run them one by one). When we write code, it compiles to set of CPU instructions.
How can we have asynchronous behavior in software on a UP machine? Isn't everything just run in some fixed order chosen by the OS?
Even an out-of-order execution CPU gives the illusion of running instructions in program order. (This is separate from memory reordering observed by other cores or devices in the system. In a UP system, runtime memory reordering is only relevant for device drivers.)
An interrupt handler is a piece of code that runs asynchronously to the rest of the code, and can happen in response to an interrupt from a device outside the CPU. In user-space, a signal handler has equivalent semantics.
(Or a hardware interrupt can cause a context switch to another software thread. This is asynchronous as far as the software thread is concerned.)
Events like interrupts from network packets arriving or disk I/O completing happen asynchronously with respect to whatever the CPU was doing before the interrupt.
Asynchronous doesn't mean simultaneous, just that it can run between any two machine instructions of the rest of the code. A signal handler in a user-space program can run between any two machine instructions, so the code in the main program must work in a way that doesn't break if this happens.
e.g. A program with a signal-handler can't make any assumptions about data on the stack below the current stack pointer (i.e. in the un-reserved part of the stack). The red-zone in the x86-64 SysV ABI is a modification to this rule for user-space only, since the kernel can respect it when transferring control to a signal handler. The kernel itself can't use a red-zone, because hardware interrupts write to the stack outside of software control, before running the interrupt handler.
In an OS where I/O completion can result in the delivery of a POSIX signal (i.e. with POSIX async I/O), the timing of a signal can easily be determined by the timing of a hardware interrupts, so user-space code runs asynchronously with timing determined by things external to the computer. It's not just an issue for the kernel.
On a multicore system, there are obviously far more ways for things to happen in different orders more of the time.
Many processors are capable of multithreading, and many operating systems can simulate multithreading on single-threaded processors by swapping tasks in and out of the processor.

Resources