what does the return value of cudaDeviceProp::asyncEngineCount mean? - asynchronous

I have read the documentation, and it said that if it returns 1:
device can concurrently copy memory between host and device while executing a kernel
If it is 2:
device can concurrently copy memory between host and device in both directions and execute a kernel at the same time
What exactly is the difference?

With 1 DMA engine, the device can either download data from the CPU or upload data to the CPU, but not do both simultaneously. With 2 DMA engines, the device can do both in parallel.
Regardless of the number of available DMA engines, the device also has an execution engine which can be running a kernel in parallel of ongoing memory operations.

Related

Difference between two perf events for intel processors

What is the difference between following perf events for intel processors:
UNC_CHA_DIR_UPDATE.HA: Counts only multi-socket cacheline Directory state updates memory writes issued from the HA pipe. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_CHA_DIR_UPDATE.TOR: Counts only multi-socket cacheline Directory state updates due to memory writes issued from the TOR pipe which are the result of remote transaction hitting the SF/LLC and returning data Core2Core. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_M2M_DIRECTORY_UPDATE.ANY: Counts when the M2M (Mesh to Memory) updates the multi-socket cacheline Directory to a new state.
The above description about perf events is taken from here.
In particular, if there is a directory update because of the memory write request coming from a remote socket then which perf event will account for that if any?
As per my understanding, since the CHA is responsible for handling the requests coming from the remote sockets via UPI, the directory updates which are caused by the remote requests should be reflected by UNC_CHA_DIR_UPDATE.HA or UNC_CHA_DIR_UPDATE.TOR. But when I run a program (which I will explain shortly), the UNC_M2M_DIRECTORY_UPDATE.ANY count is much larger (more than 34M) whereas the other two events have the count in the order of few thousand. Since there are no other writes happening other than those coming from the remote socket it seems that UNC_M2M_DIRECTORY_UPDATE.ANY measures the number of directory updates(and not the other two events which) happening due to remote writes.
Description of the system
Intel Xeon GOLD 6242 CPU (Intel Cascadelake architecture)
4 sockets with each socket having PMEM attached to it
part of the PMEM is configured to be used as a system RAM on sockets 2 and 3
OS: Linux (kernel 5.4.0-72-generic)
Description of the program:
Note: use numactl to bind the process to node 2 which is a DRAM node
Allocate two buffers of size 1GB each
Initialize these buffers
Move the second buffer to the PMEM attached to socket 3
Perform a data copy from the first buffer to the second buffer

OpenCL: can I do simultaneous "read" operations?

I have an OpenCL buffer created with the read and write flag. Can I access the same memory address simultaneously? say, calling enqueueReadBuffer and a kernel that doesn't modify the contents out-of-order without waitlists, or two calls to kernels that only read from the buffer.
Yes, you can do so. Create 2 queues, then call clEnqueieReadBuffer and clEnqueueNDRangeKernel on the different queue.
It ultimately depends on weather the device and driver supports executing different queues at the same time. Most GPUs can while embedded devices may or may not.

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

How does OpenCL host execute kernels on itself?

When we have multi-core CPU, OpenCL treats it as a single device with multiple compute units and for every device we can create some command queues.How can CPU as a host, create a command queue on itself? I think in this situation it become multithreading rather than parallel computing.
Some devices, including most CPU devices can be segmented into sub-devices using the extension "cl_ext_device_fission". When you use device fission, you are still getting parallel processing in that the host thread can do other tasks while the kernel is running on some CPU cores.
When not using device fission, the CPU device will block essentially block the host program while a kernel is running. Even if some opencl implementations are non-blocking during kernel execution, the performance hit to the host would be too great to allow much work to be done by the host thread.
So it's still parallel computation, but I guess the host application's core is technically multi-threading during kernel execution.
It's parallel computing using multithreading. When you use an OpenCL CPU driver and enqueue kernels for the CPU device, the driver uses threads to execute the kernel, in order to fully leverage all of the CPUs cores (and also typically uses vector instructions like SSE to fully utilize each core).

Memory transfer between host and device in OpenCL?

Consider the following code which creates a buffer memory object from an array of double's of size size:
coef_mem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, (sizeof(double) * size), arr, &err);
Consider it is passed as an arg for a kernel. There are 2 possibilities depending on the device on which the kernel is running:
The device is same as host device
The device is other than host device
Here are my questions for both the possibilities:
At what step is the memory transferred to the device from the host?
How do I measure the time required for transferring the memory from host to device?
How do I measure the time required for transferring the memory from
device's global memory to private memory?
Is the memory still transferred if the device is same as host device?
Will the time required to transfer from host to device be greater
than the time required for transferring from device's global memory
to private memory?
At what step is the memory transferred to the device from the host?
The only guarantee you have is that the data will be on the device by the time the kernel begins execution. The OpenCL specification deliberately doesn't mandate when these data transfers should happen, in order to allow different OpenCL implementations to make decisions that are suitable for their own hardware. If you only have a single device in the context, the transfer could be performed as soon as you create the buffer. In my experience, these transfers usually happen when the kernel is enqueued (or soon after), because that is when the implementation knows that it really needs the buffer on a particular device. But it really is completely up to the implementation.
How do I measure the time required for transferring the memory from host to device?
Use a profiler, which usually shows when these transfers happen and how long they take. If you transfer the data with clEnqueueWriteBuffer instead, you could use the OpenCL event profiling system.
How do I measure the time required for transferring the memory from device's global memory to private memory?
Again, use a profiler. Most profilers will have a metric for the achieved bandwidth when reading from global memory, or something similar. It's not really an explicit transfer from global to private memory though.
Is the memory still transferred if the device is same as host device?
With CL_MEM_COPY_HOST_PTR, yes. If you don't want a transfer to happen, use CL_MEM_USE_HOST_PTR instead. With unified memory architectures (e.g. integrated GPU), the typical recommendation is to use CL_MEM_ALLOC_HOST_PTR to allocate a device buffer in host-accessible memory (usually pinned), and access it with clEnqueueMapBuffer.
Will the time required to transfer from host to device be greater than the time required for transferring from device's global memory to private memory?
Probably, but this will depend on the architecture, whether you have a unified memory system, and how you actually access the data in kernel (memory access patterns and caches will have a big effect).

Resources