calculate program execution time in embedded device from python code

calculate program execution time in embedded device from python code - microcontroller

I have a python program which I want to deploy in an MCU. Before selecting an MCU for this task, i want to estimate the absolute base requirements for the MCU. On a M1 pro chip the self CPU execution time is 231.259ms and used memory is 16mb. How do I find out the number of instructions executed in python and how much time it will need to execute the program in an MCU?

Related

read and copy buffer from kernel in CPU to kernel in FPGA with OpenCL

I'm trying to speed up Ethash algorithm on Xilinx u50 FPGA. My problem is not about FPGA, it is about pass DAG file that are generated in CPU and send it to FPGA.
first I'm using this code in my test. I made a few changes to support Intel OpenCL driver. now if I only using CPU to run Ethash (or in this case xleth) program all process are be done. but in my case I first generate DAG file in CPU and with using 4 core it take 30 second for generating epoch number 0. after that I wanna pass DAG file (in code showing with m_dag) to a new buffer look like g_dag to send it in u50 HBMs.
I can't using only one context in this program, because I'm using 2 separated kernel files (.cl for CPU and .xclbin for FPGA) and when I try to make program and kernel it send me error 33 (CL_INVALID_DEVICE). so I make separate context (with name g_context).
now I wanna know how can I send data from m_contex to g_context? and it that ok and optimize in performance?(send me another solution if you have.)
I send my code in this link so pls if you can, just send me code solution.

Finding the compute unit id when launching a kernel on an AMD GPU

I am using ROCm software stack to compile and run OpenCL programs on a Polaris20 GCN4th AMD GPU and wondering if there is a way to find out which compute unit (id) on GPU is in use now by the current work-item or wavefront?
In other words, can I associate a computation in a kernel to a specific compute unit or specific hardware on GPU, so I can keep track of which part of the hardware is getting utilized while a kernel runs.
Thank you!

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.

It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.

The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

OpenCL kernel compilation time

I am new to the FPGA world. I tried to compile some OpenCl programs, but I noticed that it takes very long time to compile even the "Hello_World" program (couple hours). So that I am wondering why compiling OpenCL kernel on FPGA's takes long time (hours)? In addition, did the FPGA get re-programmed when we compile/execute the OpenCl on it?

Converting sequential code to hardware is difficult, and in some cases the compiler tries multiple versions of things to find the most optimal combination of hardware. It's not like compiling for CPUs and GPUs, so the workflow is quite different (you compile kernels at build time and not runtime). The end result is often hardware that is faster and/or uses less energy than more general purpose compute devices like CPUs and GPUs. There are some excellent "OpenCL on Altera" videos that explain how the compilation works, but a summary is: Compile to a abstract machine, and for each instruction/step, remove the abstract machine hardware not needed for that step, then merge all the remaining hardware into what gets programmed on the chip. The data "flows" through the hardware rather than living in memory and registers like it does on a CPU/GPU.

opencl long kernel execution time

I implement some image processing on opencl on GPU. On host program i launch this kernel 4 times, total time of this about 13 ms(on AMD profiler), it is good result, i think, but if i measure kernel execution time on host by QueryPerformanceTimer it shows about 26 ms. clEnqueueNDRangeKernel execution time smaller than 1 ms. Where is 26-13 ms ? how fix it? i launch it on GPU 1: AMD Radeon HD 6900 Series, using AMD SDK 3.0. If i launch kernel once, but in kernel i add 4 times cycle resulte is same.

´clEnqueueNDRangeKernel´ as the name says it is an "enqueue" call. SoO it only queues work to a command queue. That does not mean that the work is completed before the call returns, in fact it may have not been even started.
The API has probably just packed the work in a tidy structure of commands, and add it to the queue (submit phase).
You have to measure the kernel execution using the event timer (clEvents) with a Profiling enabled queue. That is the real execution time on the device.
Alternatively, it is possible to measure the total "roundtrip" time by measuring from ´enqueue´ to ´clFinish´. But that will include all the overheads that usually are hided in a pipeline scenario, so normally the first approach is preferred.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex