read and copy buffer from kernel in CPU to kernel in FPGA with OpenCL - opencl

I'm trying to speed up Ethash algorithm on Xilinx u50 FPGA. My problem is not about FPGA, it is about pass DAG file that are generated in CPU and send it to FPGA.
first I'm using this code in my test. I made a few changes to support Intel OpenCL driver. now if I only using CPU to run Ethash (or in this case xleth) program all process are be done. but in my case I first generate DAG file in CPU and with using 4 core it take 30 second for generating epoch number 0. after that I wanna pass DAG file (in code showing with m_dag) to a new buffer look like g_dag to send it in u50 HBMs.
I can't using only one context in this program, because I'm using 2 separated kernel files (.cl for CPU and .xclbin for FPGA) and when I try to make program and kernel it send me error 33 (CL_INVALID_DEVICE). so I make separate context (with name g_context).
now I wanna know how can I send data from m_contex to g_context? and it that ok and optimize in performance?(send me another solution if you have.)
I send my code in this link so pls if you can, just send me code solution.


Low performance with MPI communication within a single node

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:

Collecting an MPI Trace

How can I collect an MPI communication trace on Supercomputers?
I need text files with details of each message (say sender, receiver, size, etc.) that I can parse.
I was using following command for Intel MPI and do not see any text files.
mpirun -trace -n 4 -trace-pt2pt -trace-collectives ./myApp
I am not familiar with Intel MPI's integrated solution.
There is a number of tools that provide MPI tracing.
Performance focussed:
Score-P (Fileformat OTF2)
Correctness checking:
I recommend to not roll your own solution, because it's not straight forward to match receives to sends and you might run into timing issues because timers are not synchronized across nodes.
You could e.g. trace a run using Score-P, and then use the otf2-print command on the trace to get the text output you wanted. Or you can use the OTF2 reader library and develop a tool on top of it. Here is a short tutorial on how to run Score-P, starting at slide 17

how to know when no data is coming on serial port unix

I'm working with 2 little machines with limited unix tools. Both are conected between each other via serial. I'm transfering binary data, so the devices are on raw mode. The sending machine is sending files to the other one and between there's a delay of X ms (specified as parameter). I would like to know if it's possible to measure those delays on destination machine in order to identify how many files are coming. Till now i was using cat < /dev/ttyS5, but this is not a option to my purpose.
Any idea?
IMHO the easiest way is to write a little program which is waiting for bytes on the serial line.
Everytime a character arrives some sort of timer/timestamp is reset.
Another thread could be evaluation this timer/timestamp in a loop and increment a counter if it's larger than a defined value.
But please be aware that you might experience delays from the serial line as there's the kernel and its scheduler "in between". Furthermore you'll need appropriate locking of course!

On what parameters boot sequence varies?

Does every Unix flavor have same boot sequence code ? I mean there are different kernel version releases going on for different flavors, so is there possibility of different code for boot sequence once kernel is loaded? Or they keep their boot sequence (or code) common always?
Edit: I want to know into detail how boot process is done.
Where does MBR finds a GRUB? How this information is stored? Is it by default hard-coded?
Is there any block level partion architecture available for boot sequence?
How GRUB locates the kernel image? Is it common space, where kernel image is stored?
I searched a lot on web; but it shows common architecture BIOS -> MBR -> GRUB -> Kernel -> Init.
I want to know details of everything. What should I do to know this all? Is there any way I could debug boot process?
Thanks in advance!
First of all, the boot process is extremely platform and kernel dependent.
The point is normally getting the kernel image loaded somewhere in memory and run it, but details may differ:
where do I get the kernel image? (file on a partition? fixed offset on the device? should I just map a device in memory?)
what should be loaded? (only a "core" image? also a ramdisk with additional data?)
where should it be loaded? Is additional initialization (CPU/MMU status, device initialization, ...) required?
are there kernel parameters to pass? Where should they be put for the kernel to see?
where is the configuration for the bootloader itself stored (hard-coded, files on a partition, ...)? How to load the additional modules? (bootloaders like GRUB are actually small OSes by themselves)
Different bootloaders and OSes may do this stuff differently. The "UNIX-like" bit is not relevant, an OS starts being ostensibly UNIXy (POSIX syscalls, init process, POSIX userland,...) mostly after the kernel starts running.
Even on common x86 PCs the start differs deeply between "traditional BIOS" and UEFI mode (in this last case, the UEFI itself can load and start the kernel, without additional bootloaders being involved).
Coming down to the start of a modern Linux distribution on x86 in BIOS mode with GRUB2, the basic idea is to quickly get up and running a system which can deal with "normal" PC abstractions (disk partitions, files on filesystems, ...), keeping at minimum the code that has to deal with hardcoded disk offsets.
GRUB is not a monolithic program, but it's composed in stages. When booting, the BIOS loads and executes the code stored in the MBR, which is the first stage of GRUB. Since the amount of code that can be stored there is extremely limited (few hundred bytes), all this code does is to act as a trampoline for the next GRUB stage (somehow, it "boots GRUB");
the MBR code contains hard-coded the address of the first sector of the "core image"; this, in turn, contains the code to load the rest of the "core image" from disk (again, hard-coded as a list of disk sectors);
Once the core image is loaded, the ugly work is done, since the GRUB core image normally contains basic file system drivers, so it can load additional configuration and modules from regular files on the boot partition;
Now what happens depends on the configuration of the specific boot entry; for booting Linux, usually there are two files involved: the kernel image and the initrd:
initrd contains the "initial ramdrive", containing the barebones userland mounted as / in the early boot process (before the kernel has mounted the filesystems); it mostly contains device detection helpers, device drivers, filesystem drivers, ... to allow the kernel to be able to load on demand the code needed to mount the "real" root partition;
the kernel image is a (usually compressed) executable image in some format, which contains the actual kernel code; the bootloader extracts it in memory (following some rules), puts the kernel parameters and initrd memory position in some memory location and then jumps to the kernel entrypoint, whence the kernel takes over the boot process;
From there, the "real" Linux boot process starts, which normally involves loading device drivers, starting init, mounting disks and so on.
Again, this is all (x86, BIOS, Linux, GRUB2)-specific; points 1-2 are different on architectures without an MBR, and are are skipped completely if GRUB is loaded straight from UEFI; 1-3 are different/avoided if UEFI (or some other loader) is used to load directly the kernel image. The initrd thing may be not involved if the kernel image already bundles all that is needed to start (typical of embedded images); details of points 4-5 are different for different OSes (although the basic idea is usually similar). And, on embedded machines the kernel may be placed directly at a "magic" location that is automatically mapped in memory and run at start.
