OpenCL 'non-blocking' reads have higher cost than expected - opencl

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?

You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

Related

Performance degrades a lot by using mpirun to execute my program

I am new to the field of MPI. I write my program by using Intel Math Kernel Library and I want to compute a matrix-matrix multiplication by blocks, which means that I split the large matrix X into many small matrixs along the column as the following. My matrix is large, so each time I only compute (N, M) x (M, N) where I can set M manually.
XX^T = X_1X_1^T + X_2X_2^T + ... + X_nX_n^T
I first set the number of total threads as 16 and M equals to 1024. Then I run my program directly as the following . I check my cpu state and I find that the cpu usage is 1600%, which is normal.
./MMNET_MPI --block 1024 --numThreads 16
However, I tried to run my program by using MPI as the following. Then I find that cpu usage is only 200-300%. Strangely, I change the block number to 64 and I can get a little performance improvement to cpu usage 1200%.
mpirun -n 1 --bind-to none ./MMNET_MPI --block 1024 --numThreads 16
I do not know what the problem is. It seems that mpirun does some default setting which has an impact on my program. The following is a part of my matrix multiplication code. The command #pragma omp parallel for aims to extract the small N by M matrix from compression format parallel. After that I use clubs_dgemv to compute the matrix-matrix multiplication.
#include "MemoryUtils.h"
#include "Timer.h"
#include "omp.h"
#include <mpi.h>
#include <mkl.h>
#include <iostream>
using namespace std;
int main(int argc, char** argv) {
omp_set_num_threads(16);
Timer timer;
double start_time = timer.get_time();
MPI_Init(&argc, &argv);
int total_process;
int id;
MPI_Comm_size(MPI_COMM_WORLD, &total_process);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
if (id == 0) {
cout << "========== Testing MPI properties for MMNET ==========" << endl;
}
cout << "Initialize the random matrix ..." << endl;
unsigned long N = 30000;
unsigned long M = 500000;
unsigned long snpsPerBlock = 1024;
auto* matrix = ALIGN_ALLOCATE_DOUBLES(N*M);
auto* vector = ALIGN_ALLOCATE_DOUBLES(N);
auto* result = ALIGN_ALLOCATE_DOUBLES(M);
auto *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock);
memset(result, 0, sizeof(double) * M);
cout << "Time for allocating is " << timer.update_time() << " sec" << endl;
memset(matrix, 1.1234, sizeof(double) * N * M);
memset(vector, 1.5678, sizeof(double) * N);
// #pragma omp parallel for
// for (unsigned long row = 0; row < N * M; row++) {
// matrix[row] = (double)rand() / RAND_MAX;
// }
// #pragma omp parallel for
// for (unsigned long row = 0; row < N; row++) {
// vector[row] = (double)rand() / RAND_MAX;
// }
cout << "Time for generating data is " << timer.update_time() << " sec" << endl;
cout << "Starting calculating..." << endl;
for (unsigned long m0 = 0; m0 < M; m0 += snpsPerBlock) {
uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0;
auto* snpBlock = matrix + m0 * N;
MKL_INT row = N;
MKL_INT col = snpsPerBLockCrop;
double alpha = 1.0;
MKL_INT lda = N;
MKL_INT incx = 1;
double beta = 0.0;
MKL_INT incy = 1;
cblas_dgemv(CblasColMajor, CblasTrans, row, col, alpha, snpBlock, lda, vector, incx, beta, temp1, incy);
// compute XA
double beta1 = 1.0;
cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, result, incy);
}
cout << "Time for computation is " << timer.update_time() << " sec" << endl;
ALIGN_FREE(matrix);
ALIGN_FREE(vector);
ALIGN_FREE(result);
ALIGN_FREE(temp1);
return 0;
}
My cpu information is as the following.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 44
On-line CPU(s) list: 0-43
Thread(s) per core: 1
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6152 CPU # 2.10GHz
Stepping: 4
CPU MHz: 1252.786
CPU max MHz: 2101.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-21
NUMA node1 CPU(s): 22-43
MKL by default implements some intelligent dynamic selection of the number of threads to use. This is controlled by the variable MKL_DYNAMIC, which is set to TRUE by default. The documentation for MKL states:
If you [sic] are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread.
Since you call MPI_Init() and not MPI_Init_thread() to initialise MPI, you are effectively asking for single-threaded MPI level (MPI_THREAD_SINGLE). The library is free to provide you any threading level and it will conservatively stick to MPI_THREAD_SINGLE. You can check that by calling MPI_Query_thread(&provided) after the initialisation and see if the output value is greater than MPI_THREAD_SINGLE.
Since you are mixing OpenMP and threaded MKL with MPI, you should really tell MPI to initialise at a higher threading support level by calling MPI_Init_thread():
int provided;
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
// This ensures that MPI actually provides MPI_THREAD_MULTIPLE
if (provided < MPI_THREAD_MULTIPLE) {
// Complain
}
(technically, you need MPI_THREAD_FUNNNELED, if you do not make MPI calls from outside the main thread, but that is not thread-safe mode as MKL understands it)
Even if you request certain thread support level from MPI, there is no guarantee that you will get it, which is why you have to examine the provided level. Also, older Open MPI versions must explicitly be build with such support - the default is to not build with support for MPI_THREAD_MULTIPLE as some network modules are not thread-safe. You can check if that's the case by running ompi_info and looking for a line similar to this one:
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Now, the reality is that most threaded software that does not make MPI calls outside the main thread runs perfectly fine even if MPI does not provide higher level of thread support than MPI_THREAD_SINGLE, i.e., with most MPI implementations MPI_THREAD_SINGLE is equivalent to MPI_THREAD_FUNNELED. In that case, setting MKL_DYNAMIC to FALSE should make MKL behave as when run without mpirun:
mpirun -x MKL_DYNAMIC=FALSE ...
In any case, since your program accepts the number of threads as an argument, simply call both mkl_set_num_threads() and omp_set_num_threads() and do not rely on magical default mechanisms.
Edit: Enabling full thread support has consequences - increased latency and some network modules may refuse to work, for example the InfiniBand module in older Open MPI versions, resulting in the library quietly switching to slower transports such as TCP/IP. Better request MPI_THREAD_FUNNELED and explicitly set the number of MKL and OpenMP threads.

Arduino IDE - global variables storage in RAM or flash memory

Conventional wisdom has it that global and static data is stored in the bottom of RAM along with other stuff. Somewhere above that is the heap, then free memory and at the top of RAM, is the stack. See the code below before reading on.
When I compile this with the Arduino IDE (1.8.10) on a MEGA2560 I get the following statistics:
Sketch uses 2750 bytes (1%) of program storage space. Maximum is 253952 bytes.
Global variables use 198 bytes (2%) of dynamic memory, leaving 7994 bytes for local variables. Maximum is 8192 bytes.
If I change ARRAY_SIZE from 1 to 7001, I get exactly the same numbers. I expected that dynamic memory should increase by 7000. When I do the same comparison with AtmelStudio V7, dynamic memory does indeed increase by 7000.
One more piece of information along these lines. If I do a malloc of 7000 which is pretty close to free memory, one would expect that the malloc should be successful when ARRAY_SIZE equals one and would fail when the ARRAY_SIZE equals 7001. I was dismayed to find that the malloc was successful with both the small and the large array sizes. With AtmelStudio this does not happen.
I suspect respective compiler/linker options somewhere could explain the difference (AtmelStudio - project properties and Arduino IDE - platform.txt perhaps?).
I also suspect that the Arduino IDE dynamically allocates global variables in FlashMemory.
I am not a newbie, but I am not a guru - comments anyone?
Thanks
Michèle
#define ARRAY_SIZE 7001
uint8_t globalArray[ARRAY_SIZE]{1};
void setup() {
Serial.begin(115200);
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
Serial.print(F("Just initialized globalArray, size = "));Serial.println(ARRAY_SIZE);
uint8_t* testPointer = (uint8_t*) malloc(7000);
Serial.print(F("Allocation of 7000 bytes "));
if ( testPointer != (uint8_t*) 0) {
Serial.print(F("SUCCESSFUL"));
} else {
Serial.print(F("NOT SUCCESSFUL"));
}
} // setup
void loop() {} // loop
I ran some more tests to figure out why AtmelStudio and the Arduino IDE are supplying vastly different RAM usage values after declaring an array. The response from juraj (thank you) was that the compiler optimized unused code away. This answer was true however I had included an array initialization loop to make sure that the compiler would include the array in the code.
It turns out that AtmelStudio and the Arduino IDE have different criteria as to code what it means "code being used". The outcome is that globalArray, in the initialization line,
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
is considered by AtmelStudio as being used and by the Arduino IDE as not being used.
The Arduino IDE requires that globalArray appear on the left of an assignment statement to consider it as being used thus the need for the "a+=globalArray[i];" line. The exact same code below reports:
a+=globalArray[i]; not used
Atmel Studio: Data Memory Usage as being 7211 bytes
Arduino IDE: Global variables use 198 bytes
a+=globalArray[i]; used
Atmel Studio: Data Memory Usage as being 7211 bytes
Arduino IDE: Global variables use 7199 bytes
Q.E.D. Interesting how the two IDEs do not quite mean the same thing with "being used".
Thanks - My first time on this forum got my question answered rather quickly.
Michèle
#define ARRAY_SIZE 7001
uint8_t globalArray[ARRAY_SIZE];
void setup() {
Serial.begin(115200);
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
Serial.print(F("Just initialized globalArray, size = "));
Serial.println(ARRAY_SIZE);
// uint16_t a {0};
// for (int i = 0; i < ARRAY_SIZE; i++) a+=globalArray[i];
// Serial.print(F("Value of a = ")); Serial.println(a);
uint8_t* testPointer = (uint8_t*) malloc(7000);
Serial.print(F("Allocation of 7000 bytes "));
if ( testPointer != (uint8_t*) 0) Serial.print(F("SUCCESSFUL"));
else Serial.print(F("NOT SUCCESSFUL"));
} // setup

Is it defined to write to the same buffer from different kernels?

I have OpenCL 1.1, one device, out of order execution command queue,
and want that multiple kernels output their results into one buffer to different, not overlapped, arbitrary, regions.
Is it possible?
cl::CommandQueue commandQueue(context, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
cl::Buffer buf_as(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &as[0]);
cl::Buffer buf_bs(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, &bs[0]);
cl::Buffer buf_rs(context, CL_MEM_WRITE_ONLY, data_size, NULL);
cl::Kernel kernel(program, "dist");
kernel.setArg(0, buf_as);
kernel.setArg(1, buf_bs);
int const N = 4;
int const d = data_size / N;
std::vector<cl::Event> events(N);
for(int i = 0; i != N; ++i) {
int const beg = d * i;
int const len = d;
kernel_leaf.setArg(2, beg);
kernel_leaf.setArg(3, len);
commandQueue.enqueueNDRangeKernel(kernel, NULL, cl::NDRange(block_size_x), cl::NDRange(block_size_x), NULL, &events[i]);
}
commandQueue.enqueueReadBuffer(buf_rs, CL_FALSE, 0, data_size, &rs[0], &events, NULL);
commandQueue.finish();
I wanted to give an official committee response to this. We realise the specification is ambiguous and have made modifications to rectify this.
This is not guaranteed under OpenCL 1.x or indeed 2.0 rules. cl_mem objects are only guaranteed to be consistent at synchronization points, even when processed only on a single device and even when used by OpenCL 2.0 kernels using memory_scope_device.
Multiple child kernels of an OpenCL 2.0 parent kernel can share the parent's cl_mem objects at device scope.
Coarse-grained SVM objects can be shared at device scope between multiple kernels, as long as the memory locations written to are not overlapping.
The writes should work fine if the global memory addresses are non-overlapping as you have described. Just make sure both kernels are finished before reading the results back to the host.
I don't think it is defined. Although you say you are writing to non-overlapping regions at the software level, it is not guaranteed that at the hardware level the accesses won't map onto same cache lines - in which case you'll have multiple modified versions flying around.

Simple Vector Geometric Progression Design in OpenCL

I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!

Asynchronous execution of commands from two command queues in OpenCL

I am trying to work out an application that can utilize both CPU and GPU at the same time by OpenCL. Specifically, I have two kernels, one for CPU executing, and one for GPU. CPU kernel will change the content of one buffer, and GPU will do other things when GPU detects that the buffer has been changed by CPU.
__kernel void cpuKernel(__global uint * dst1,const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < size)
{
atomic_xchg(&dst1[tid],10);
tid += size;
}
}
__kernel void gpuKernel(__global uint * dst1, __global uint * dst2, const uint size)
{
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < vectorSize)
{
while(dst1[vectorOffset + tid] != 10)
;
dst2[vectorOffset + tid] = dst1[vectorOffset+tid];
tid += size;
}
}
As shown above, cpuKernel will change each element of dst1 buffer to 10, correspondingly, after GPU detect such changes, it will assign the element value (10) to the same place of another buffer dst2. cpuKernel is queued in command1 which is associated with CPU device, and gpuKernel is queued in command2 which is associated with GPU device, two command queues have been set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag.
Then I make two cases:
case 1:
clEnqueueNDRangeKernel(command2,gpuKernel);
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clfinish(command2);
case 2:
clEnqueueNDRangeKernel(command1,cpuKernel);
clfinish(command1);
clEnqueueNDRangeKernel(command2,gpuKernel);
clfinish(command2);
But the results show that the time consumed in two cases are nearly the same, but I expect there will be some overlapping in case 1, but there is not. Can anyone help me? Thanks!
Or, can anyone help to explain how to implement two kernels running on two devices asynchronously in OpenCL?
You are asking too much. As you have probably noticed, buffer objects are relative to a context, while command queues are related to devices.
If a kernel operates on a buffer object, the corresponding data must be on this device. If you do not transfer it explicitely with clEnqueueWriteBuffer(), OpenCL will do that for you.
Hence, if you modify a buffer object with a kernel on one device (for example the CPU), and just after on another device (for example the GPU), the OpenCL driver will wait for the first kernel to finish, transfer the data, and then run the second kernel.

Resources