Is It Possible To Edit OpenCL Kernels On The GPU?

Is It Possible To Edit OpenCL Kernels On The GPU? - opencl

I am looking to edit (or generate new) OpenCL kernels on the GPU from an executing kernel.
Right now, the only way I can see of doing this is to create a buffer full of numbers representing code on the device, send that buffer to an array on the host, generate/compile a new kernel on the host, and enqueue the new kernel back to the device.
Is there any way to avoid a round trip to the host, and just edit kernels on the device?
Can a kernel access the registers in any way?

Related

How does a kernel driver talk with another device?

I have an FPGA board with unix-based firmware. I need write out the program to run on this firmware that will send commands to some devices via I2C bus and will receive responses. I use for this special character file in Unix that i map in my program and write to it special commands & read from it responses. Each memory area in this mapped memory corresponds specific register of the FPGA which specified in Unix-based firmware (as i understand).
So, the question is the next one. As i understand, when i write some command to that mapped memory region of the special character file the kernel calls certain driver to handle bytes that I've written and send them through I2C bus (for example). Am I right? If so, is there some guarantee that the response from that device will be buffered and I will be able to read it from the mapped region in any time? Or does it depend on implementation specific driver?
I'm sorry if question is not clear some way, I am a newbie in this stuff.

Vector Table for Bootloader

Most, if not all, microcontrollers have a vector table for all the exceptions that are encountered when the program is running. I am quite confused as to whether the bootloader also has its own vector table for executing the reset handler?

I have written a bootloader that has been ported between a number of different embedded processors used within comms products (MSP430, AVR32, DSP56300). This loader had quite limited functional requirements
Hardware initialisation.
Validate the main code image using a checksum or cryptographic signature.
Transfer of control to the main application if the validation passes.
Provide a minimal command interface over a serial port to allow for firmware update.
Reentry point to allow the main application to trigger a new firmware load.
This transfer of operation between the two programmes held in non volatile memory on the same processor meant that I had to provide a re-direction of the interrupt functions between the two completely separate applications. The Interrupt table for the MSP430 is held in FLASH and so cannot be modified easily. It also holds the reset vector that must always point to the start of the bootloader code so erasing and rewriting this area runs the risk of completely bricking the unit.
The solution in this case was to have an interrupt service routine in the bootloader that redirected through a vector table located at a fixed location within the application memory space. Using this method added one indirect jump instruction to each interrupt handler resulting in a minimal extra processor load for each interrupt that is processed. The bootloader was written to not use any interrupts relying on polling the serial port status for all of the communications. If the bootloader needed to use interrupts then the vector table can be moved to RAM and initialised by the bootloader or application as required. (My processor did not have enough RAM to allow this)
There was a simple data structure at a fixed location in the main application code that contained entries for each possible interrupt/exception and the start address of the application code (you could consider this as the application code reset vector). As long as the application provided this data structure in flash and correctly populated it could be compiled and build completely separately as if it was the only application on the processor.

A bootloader is just a program there is no magic there, as with any other program if it needs a vector table the programmers create a vector table. Certainly if it needs to run from reset then it must conform to the hardware/processor rules for that. The processor nor any of the other logic has any way to know that that collection of bits, that collection of instructions is a bootloader or not. It just runs it.
I assume you are talking about the many, but not all, flavors/brands of microcontrollers that have bootloaders programmed at the factory?
Within an mcu there is a processor core with a memory bus. The many items in the chip, ram, flash, gpio, spi, timer, uart, adc, etc all live on that bus like having more than one house on a particular street. Each of these items on the processor bus have address decoders looking for their address on the bus to know when to take or provide data/information. The application flash and bootloader rom are no different, not special. As easy as it is to do in software it is as easy in hardware to have some sort of if-then-else that allows the application flash to answer the processor at the vector table addresses, that flag that drives the then or else cases can be an input pin to the part. Tie the pin high and hit the reset and the part boots from one memory, tie the pin low hit the reset and the part boots from the other. There are other solutions out there that are perhaps not driven by an input pin to the part but by the bootloader software.
Sometimes the processor core itself has a signal that the chip vendor drives however it wants, that within the processor core it does an if-then else, perhaps the then case is 0x00000000 is where it looks for the vector table, and the else case 0xFFFF0000.
At the end of the day though a bootloader is just another program, written by someone for this chip, no different than the program you write for the chip other than the fact that if it is a factory programmed rom bootloader then the programmer for that bootloader might have additional information/documentation for that part that we dont have. But being a bootloader is not magic it is just a program, a collection of instructions. our application can be a bootloader as well, with several possible different applicaitons based on what our bootloader finds, their bootloader can boot our bootloader which then chooses from several applications and runs one of them. All within one part...

Asynchronous data processing

I am new in OpenCl. I was wondering if you could answer my following question.
I have a queue of data packets which acts like router queues. The packets arrive, store in the queue, then processed by the router. Finally, they are inserted to the out-going queue.
I am trying to use OpenCl to process the packets concurrently. I know that we can use the buffers, for example, to transfer data between the host and OpenCl devices. We load the buffers with input/output data. Then, we set the kernel parameters using these input/output buffers. when the kernels running are done, we read the data from the OpenCl devices.
My question is that, how can I write/read a buffer for each single data packet independently of the other packets?
In other words, suppose one data packet arrives, the router needs to process it on a computing device (e.g. core #1 on GPU). Then another packets arrives. The router needs to process the second packet on a different computing device (e.g. core #2 on GPU). The processing of these two packets are actually happening concurrently but asynchronously. How this could be implemented in OpenCl?
Thanks for your reply in advance.
Regards,
Alireza.

I'd recommend a ring buffer of OpenCL buffers to hold your incoming data, and as you fill them enqueue kernels to work on them. You're only going to get good performance if the kernels have parallel data to work on. Even GPUs that can work on multiple kernels at the same time that number is small (like perhaps 2). The real power is the parallel computation within a kernel, otherwise your hardware will be idle.
To get kernels to run in parallel you need to use separate command queues since using a single command queue implies serial execution (unless it is an out-of-order command queue, but those are not widely supported).

linux serial ports -- mulithread program

I am working on a smartcard reader project here i will have to read/write data from the smartcard reader.
Also i will have to read/write data from PC application.
There are two serial port on my microcontroller one connected to smartcard reader other to PC.
Smartcard reader <------> Microcontroller <-----> PC
I have ported linux & using /ttys0 & /ttys1 driver for this.
1> My question is if application have to find that some data is available to be read from the port than will i have to always check it with read() system call ?
2> Does ttys0 driver have internal buffer to store received data ? Or data is lost if application do not read data immediately ?
3> Here using seprate threads for rx/tx from each port, is it right approach ?
Please guide me i am new to Embedded linux.
//John

Yes, there's a fair amount of buffering on linux tty's.
You have a few choices for how to interact with them.
you can make them non-blocking, and frequently poll to see if you can read data from them (but this may result in uselessly spinning CPU cycles, slowing other tasks)
you can use select() to yield to the scheduler until one of your devices has data for you to act on
you can use blocking I/O, however since you have multiple ports that may also require multiple threads

TTY programming is similar to socket programming in Linux. So basically you can set the socket to be a asynchronous and receive a signal once data is available. Regarding buffering, yes it's buffered using two flipping buffers. You can check chapter 18 in Linux device drivers 3rd edition regarding TTY implementation in the kernel.

Who captures packets first - kernel or driver?

I am trying to send packets from one machine to another using tcpreplay and tcpdump.
If I write a driver for capturing packets directly from the NIC, which path will be followed?
1) N/w packet ----> NIC card ----> app (no role of kernel)
2) N/w packet -----> Kernel -----> NIC card ---> app
Thanks

It's usually in this order:
NIC hardware gets the electrical signal, hardware updates some of its registers and buffers, which are usually mapped into computer physical memory
Hardware activates the IRQ line
Kernel traps into interrupt-handling routine and invokes the driver IRQ handling function
The driver figures out whether this is for RX or TX
For RX the driver sets up DMA from NIC hardware buffers into kernel memory reserved for network buffers
The driver notifies upper-layer kernel network stack that input is available
Network stack input routine figures out the protocol, optionally does filtering, and whether it has an application interested in this input, and if so, buffers packet for application processing, and if a process is blocked waiting on the input the kernel marks it as runnable
At some point kernel scheduler puts that process on a CPU and resumes is, application consumes network input
Then there are deviations from this model, but those are special cases for particular hardware/OS. One vendor that does user-land-direct-to-hardware stuff is Solarflare, there are others.

A driver is the piece of code that interacts directly with the hardware. So it is the first piece of code that will see the packet.
However, a driver runs in kernel space; it is itself part of the kernel. And it will certainly rely on kernel facilities (e.g. memory management) to do its job. So "no role of kernel" is not going to be true.