I'm porting some CUDA code to OpenCL. The CUDA code does something like this:
// GPU code...
__device__ int counter;
// CPU code...
int data;
cudaMemcpyFromSymbol(gpuData, &data, sizeof(int), 0, cudaMemcpyDeviceToHost);
What would be the equivalent in OpenCL? Similarly, how would I do this for an array? The only way I know involves allocating an extra buffer, copying to it on the GPU, then reading from it on the CPU.
Use the clEnqueueReadBuffer command, or with the C++ bindings:
T* host_buffer = nullptr; // host buffer
cl::Buffer device_buffer; // device buffer
cl::CommandQueue cl_queue; // command queue
cl_queue.enqueueReadBuffer(device_buffer, blocking, offset*sizeof(T), length*sizeof(T), (void*)(host_buffer+offset));
For a quick overview of OpenCL API calls, see the OpenCL Reference Card. If you want to make using OpenCL in C++ way easier and less bloated, see this OpenCL-Wrapper.
Related
I'm new to HPC and I am curious about a point regarding the performance of MPI over Infiniband. For reference, I am using OpenMPI over two different machines connected through IB.
I've coded a very simple benchmark to see how fast I can transfer data over IB using MPI calls. Below you can see the code.
The issue is that when I run this, I get a throughput of ~1.4 gigabytes/s. However, when I use standard ib benchmarks like ib_write_bw, I get nearly 6 GB/s. What might account for this sizable discrepancy? Am I being naive about Gather, or is this just a result of OpenMPI overheads that I can't overcome?
In addition to the code, I am providing a plot to show the results of my simple benchmark.
Thanks in advance!
Results:
Code:
#include<iostream>
#include<mpi.h>
#include <stdint.h>
#include <ctime>
using namespace std;
void server(unsigned int size, unsigned int n) {
uint8_t* recv = new uint8_t[size * n];
uint8_t* send = new uint8_t[size];
std::clock_t s = std::clock();
MPI_Gather(send, size, MPI_CHAR, recv, size, MPI_CHAR, 0, MPI_COMM_WORLD);
std::clock_t e = std::clock();
cout<<size<<" "<<(e - s)/double(CLOCKS_PER_SEC)<<endl;
delete [] recv;
delete [] send;
}
void client(unsigned int size, unsigned int n) {
uint8_t* send = new uint8_t[size];
MPI_Gather(send, size, MPI_CHAR, NULL, 0, MPI_CHAR, 0, MPI_COMM_WORLD);
delete [] send;
}
int main(int argc, char **argv) {
int ierr, size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
cout<<"Rank "<<rank<<" of "<<size<<endl;
unsigned int min = 1, max = (1 << 31), n = 1000;
for (unsigned int i = 1; i < n; i++) {
unsigned int s = i * ((max - min) / n);
if(rank == 0) server(s, size); else client(s, size);
}
MPI_Finalize();
}
In your code you are executing a single collective operation per message size.
This involves huge overhead in comparison with tests that were written for performance measurement (e.g. ib_write_bw).
In general, comparing MPI collectives to ib_write_bw is not apples to apples comparison:
RDMA opcode
ib_write_bw uses RDMA_WRITE operations, which doesn't use CPU at all - once the initial handshake is done, it is pure RDMA, constrained only by network and PCIe capabilities.
MPI will use different RDMA opcodes for different collectives and different message sizes, and if you do it as you did in your code, there are lots of things that MPI does for each message (hence the huge overhead)
Data overhead
ib_write_bw transfers almost pure data (there's a local routing header and a payload)
MPI has more data (headers) added to each packet to allow the receiver to identify the message
Zero copy
ib_write_bw is doing what is called "zero-copy" - data is sent from a user buffer directly, and written to a user buffer directly on the receiving side, w/o copying from/to buffers
MPI will copy the message from your client's buffer to its internal buffers on the sender side, then copy it again from its internal buffers on the receiving side to your server's buffer. Again, this behaviour depends on the message size and MPI configuration and MPI implementation, but you get the general idea.
Memory registration
ib_write_bw registers the required memory region and exchanges this info between client and server before starting measuring performance
If MPI will need to register some memory region during the collective execution, it will do it while you are measuring time
there are many more
even the "small" things like warming up the cache lines on the HCAs...
So, now that we've covered why you shouldn't compare these things, here's what you should do:
There are two libraries that are regarded as a de-facto standard for MPI performance measurement:
IMB (Intel MPI Benchmark) - it says Intel, but it is written as a standard MPI application and will work with any MPI implementation.
OSU benchmarks - again, it says MVAPICH, but it will work with any MPI.
Download those, compile with your MPI, run your benchmarks, see what you get.
This is as high as you can get with MPI.
If you get much better results than with your small program (and you will for sure) - this is open source, see how the pros are doin it :)
Have fun!
You have to consider that the full payload size for the collective call received on rank 0 depends on the number of ranks. So with, say, 4 processes sending 1000 bytes you actually receive 4000 bytes on the root rank. That includes a memory copy from rank 0's input buffer into the output buffer (possibly with a detour through the network stack). And that is before you add the overheads of MPI and the lower networking protocols.
Ok, so I have two Kernels that both take an input and an output image and do some meaningful operation:
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void Kernel1(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//perform some operation
//write voxel
}
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void Kernel2(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//perform some other operation
//write voxel
}
#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable
kernel void KernelCombined(read_only image3d_t input, write_only output)
{
//read voxel and some surrounding voxels
//...
//perform operation of both kernels (without read, write)
//...
//write voxel
}
Now I want to chain the kernels in some cases, so what I could do is first call Kernel 1 and then Kernel2. But that means, that I have unneccesary write and reads in between. I could also write a third kernel which does both, but maintaining copy-paste code seems to be annoying. I cannot really put the content of each Kernel in a separate function as I cannot pass around the image3d_t input, to my knowledge.
Question: Is there any clever way of chaining the two kernels? Is maybe OpenCL doing something clever already that I do not know?
Edit: Added example of what I would like to achive.
I understand what you're asking for -- you wish to remove the image write / read cycle between kernels. With the kernels you described, this would not be efficient. In the existing kernels you "read voxel and some surrounding voxels" -- let's say that means reading 7 voxels. If you do the same read pattern in kernel 2 and 3, it's a total of 21 reads (and 3 writes). If somehow you chained these three kernels into a single kernel that wrote a single output voxel, it would need to read from many more source voxels to have the same result (because each read step was adding radius).
The scenario where kernel write/read chaining would be helpful would be for single-in/single-out kernels, like image processing where colors are modified independently of their neighbors. To do that you need a higher-level description of your kernels, and something that can generate the kernels you need based on the operations you have.
This is possible if you're using an opencl 2.0 capable device. enqueue_kernel allows a kernel to queue another, just like EnqueueNDRange on the host.
If you're using opencl 1.2 -- and probably all 1.x, you need to return to the host and call the next kernel (or have the next kernel already queued). You don't need to copy the buffer back to the host between kernels though, so at least you don't pay for transfer multiple times.
As far as I understood from your description, you shouldn't do anything special and it will work even with OpenCL 1.2 just fine.
OpenCL Command queues are IN ORDER by default and there are no need to transfer the data in between the kernel calls.
Just leave the data on the device (don't do map/unmap and Read/Write), enqueue both kernels and wait until they are finished. Here is a code snippet of how it might look:
// Enqueue first kernel
clSetKernelArg(kernel1, 0, sizeof(cl_mem), in);
clSetKernelArg(kernel1, 1, sizeof(cl_mem), out);
clEnqueueNDRange(..., kernel1, ...);
// Enqueue second kernel
clSetKernelArg(kernel2, 0, sizeof(cl_mem), in);
clSetKernelArg(kernel2, 1, sizeof(cl_mem), out);
clEnqueueNDRange(..., kernel2, ...);
// Flush the queue and wait for the results
clFlush(...); // Start the execution
clWait(...); // Wait until all operations in the queue are done
When using OOO (OUT OF ORDER) queues one can use Events (see last 3 params in clEnqueueNDRangeKernel) to specify the dependencies between the kernels and do clWaitForEvents at the end of your pipeline.
I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.
According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:
cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);
In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.
If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:
What would happen if I call map buffer on cmDevBufIn, which do not have host side memory?
For CPU-integrated GPUs, there is no separate graphics memory. Especially, for new version of AMD APUs, the memory address is also homologus. So it seems create two buffer objects is not good. What is the best practice for integrated platforms?
Is there any way to write single lines of memory transfer code for different platforms? Or I must write several different suits of memory transfer codes to achieve best performance for Nvidia, AMD separate GPU, AMD old APU, AMD new APU and Intel HD graphics......
Unfortunately, it's different for each vendor.
NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.
With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.
With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.
With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.
What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?
You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.
Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...
In online compilation of OpenCl, we have to do...
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
But, for offline creation of program for opencl..
program = clCreateProgramWithBinary(context, 1, &device_id, (const size_t *)&binary_size, (const unsigned char **)&binary_buf, &binary_status, &ret);
where binary_buf is...
fread(binary_buf, 1, MAX_BINARY_SIZE, fp);
Hence in offline compilation, we can skip the clBuildProgram step, which makes this step faster. (Is this approach correct, that we can re-use again and again that binary for running the program?)
So, my question is how to create opencl binary file so i can skip the step of building cl program?
Once the program has been created you can use clGetProgramInfo with CL_PROGRAM_BINARY_SIZES and then CL_PROGRAM_BINARIES, storing the resulting binary programs (one for each device of the context) into a buffer you supply. You can then save this binary data to disk for use in later runs.
Not all devices might support binaries, so you will need to check the CL_PROGRAM_BINARY_SIZES result (it returns a zero size for that device if binaries are not supported).
To save time in the future (say in future runs of your application), you can use clCreateProgramWithBinary with the returned binaries. However, you will only ever want to do this with exactly the same hardware. Even when the graphics driver changes for the same hardware, you might want to throw the binary away and rebuild since there is potential the OpenCL compiler in the new driver has bugfixes and/or performance improvements.