In online compilation of OpenCl, we have to do...
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
But, for offline creation of program for opencl..
program = clCreateProgramWithBinary(context, 1, &device_id, (const size_t *)&binary_size, (const unsigned char **)&binary_buf, &binary_status, &ret);
where binary_buf is...
fread(binary_buf, 1, MAX_BINARY_SIZE, fp);
Hence in offline compilation, we can skip the clBuildProgram step, which makes this step faster. (Is this approach correct, that we can re-use again and again that binary for running the program?)
So, my question is how to create opencl binary file so i can skip the step of building cl program?
Once the program has been created you can use clGetProgramInfo with CL_PROGRAM_BINARY_SIZES and then CL_PROGRAM_BINARIES, storing the resulting binary programs (one for each device of the context) into a buffer you supply. You can then save this binary data to disk for use in later runs.
Not all devices might support binaries, so you will need to check the CL_PROGRAM_BINARY_SIZES result (it returns a zero size for that device if binaries are not supported).
To save time in the future (say in future runs of your application), you can use clCreateProgramWithBinary with the returned binaries. However, you will only ever want to do this with exactly the same hardware. Even when the graphics driver changes for the same hardware, you might want to throw the binary away and rebuild since there is potential the OpenCL compiler in the new driver has bugfixes and/or performance improvements.
Related
I'm porting some CUDA code to OpenCL. The CUDA code does something like this:
// GPU code...
__device__ int counter;
// CPU code...
int data;
cudaMemcpyFromSymbol(gpuData, &data, sizeof(int), 0, cudaMemcpyDeviceToHost);
What would be the equivalent in OpenCL? Similarly, how would I do this for an array? The only way I know involves allocating an extra buffer, copying to it on the GPU, then reading from it on the CPU.
Use the clEnqueueReadBuffer command, or with the C++ bindings:
T* host_buffer = nullptr; // host buffer
cl::Buffer device_buffer; // device buffer
cl::CommandQueue cl_queue; // command queue
cl_queue.enqueueReadBuffer(device_buffer, blocking, offset*sizeof(T), length*sizeof(T), (void*)(host_buffer+offset));
For a quick overview of OpenCL API calls, see the OpenCL Reference Card. If you want to make using OpenCL in C++ way easier and less bloated, see this OpenCL-Wrapper.
I'm new to HPC and I am curious about a point regarding the performance of MPI over Infiniband. For reference, I am using OpenMPI over two different machines connected through IB.
I've coded a very simple benchmark to see how fast I can transfer data over IB using MPI calls. Below you can see the code.
The issue is that when I run this, I get a throughput of ~1.4 gigabytes/s. However, when I use standard ib benchmarks like ib_write_bw, I get nearly 6 GB/s. What might account for this sizable discrepancy? Am I being naive about Gather, or is this just a result of OpenMPI overheads that I can't overcome?
In addition to the code, I am providing a plot to show the results of my simple benchmark.
Thanks in advance!
Results:
Code:
#include<iostream>
#include<mpi.h>
#include <stdint.h>
#include <ctime>
using namespace std;
void server(unsigned int size, unsigned int n) {
uint8_t* recv = new uint8_t[size * n];
uint8_t* send = new uint8_t[size];
std::clock_t s = std::clock();
MPI_Gather(send, size, MPI_CHAR, recv, size, MPI_CHAR, 0, MPI_COMM_WORLD);
std::clock_t e = std::clock();
cout<<size<<" "<<(e - s)/double(CLOCKS_PER_SEC)<<endl;
delete [] recv;
delete [] send;
}
void client(unsigned int size, unsigned int n) {
uint8_t* send = new uint8_t[size];
MPI_Gather(send, size, MPI_CHAR, NULL, 0, MPI_CHAR, 0, MPI_COMM_WORLD);
delete [] send;
}
int main(int argc, char **argv) {
int ierr, size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
cout<<"Rank "<<rank<<" of "<<size<<endl;
unsigned int min = 1, max = (1 << 31), n = 1000;
for (unsigned int i = 1; i < n; i++) {
unsigned int s = i * ((max - min) / n);
if(rank == 0) server(s, size); else client(s, size);
}
MPI_Finalize();
}
In your code you are executing a single collective operation per message size.
This involves huge overhead in comparison with tests that were written for performance measurement (e.g. ib_write_bw).
In general, comparing MPI collectives to ib_write_bw is not apples to apples comparison:
RDMA opcode
ib_write_bw uses RDMA_WRITE operations, which doesn't use CPU at all - once the initial handshake is done, it is pure RDMA, constrained only by network and PCIe capabilities.
MPI will use different RDMA opcodes for different collectives and different message sizes, and if you do it as you did in your code, there are lots of things that MPI does for each message (hence the huge overhead)
Data overhead
ib_write_bw transfers almost pure data (there's a local routing header and a payload)
MPI has more data (headers) added to each packet to allow the receiver to identify the message
Zero copy
ib_write_bw is doing what is called "zero-copy" - data is sent from a user buffer directly, and written to a user buffer directly on the receiving side, w/o copying from/to buffers
MPI will copy the message from your client's buffer to its internal buffers on the sender side, then copy it again from its internal buffers on the receiving side to your server's buffer. Again, this behaviour depends on the message size and MPI configuration and MPI implementation, but you get the general idea.
Memory registration
ib_write_bw registers the required memory region and exchanges this info between client and server before starting measuring performance
If MPI will need to register some memory region during the collective execution, it will do it while you are measuring time
there are many more
even the "small" things like warming up the cache lines on the HCAs...
So, now that we've covered why you shouldn't compare these things, here's what you should do:
There are two libraries that are regarded as a de-facto standard for MPI performance measurement:
IMB (Intel MPI Benchmark) - it says Intel, but it is written as a standard MPI application and will work with any MPI implementation.
OSU benchmarks - again, it says MVAPICH, but it will work with any MPI.
Download those, compile with your MPI, run your benchmarks, see what you get.
This is as high as you can get with MPI.
If you get much better results than with your small program (and you will for sure) - this is open source, see how the pros are doin it :)
Have fun!
You have to consider that the full payload size for the collective call received on rank 0 depends on the number of ranks. So with, say, 4 processes sending 1000 bytes you actually receive 4000 bytes on the root rank. That includes a memory copy from rank 0's input buffer into the output buffer (possibly with a detour through the network stack). And that is before you add the overheads of MPI and the lower networking protocols.
I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.
According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:
cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);
In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.
If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:
What would happen if I call map buffer on cmDevBufIn, which do not have host side memory?
For CPU-integrated GPUs, there is no separate graphics memory. Especially, for new version of AMD APUs, the memory address is also homologus. So it seems create two buffer objects is not good. What is the best practice for integrated platforms?
Is there any way to write single lines of memory transfer code for different platforms? Or I must write several different suits of memory transfer codes to achieve best performance for Nvidia, AMD separate GPU, AMD old APU, AMD new APU and Intel HD graphics......
Unfortunately, it's different for each vendor.
NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.
With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.
With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.
With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.
What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?
You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.
Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...
Is there a way to change the flags of a opencl buffer once allocated?
My use case is the following:
1) create data on device
2) do large amounts of work on device with said data
I want to mark the data as CL_MEM_READ_ONLY to enable possible optimisations during 2, but of course it can't be read-only when it's being created in 1.
It would be acceptable to copy the data to a new read-only buffer, but I can't see any way of doing that without going via host memory.
As pointed out in the the other answers, I also believe there not likely to be any significant performance gains to be had from using CL_MEM_READ_ONLY, as opposed to simply marking the buffer as const (or putting it in the constant address space, if small enough) inside your kernel.
However, you can achieve this using sub-buffers. If you create your buffer with CL_MEM_READ_WRITE, you can then create a sub-buffer that has the CL_MEM_READ_ONLY flag set.
cl_mem buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
cl_buffer_region = {0, size};
cl_mem robuffer = clCreateSubBuffer(buffer, CL_MEM_READ_ONLY,
CL_BUFFER_CREATE_TYPE_REGION,
(const void*)®ion, &err);
You can't mutate the flags of an existing buffer. However, I think you can create two buffers that wrap the same host memory. If you are on an integrated graphics platform like Intel or AMD and use CL_MEM_USE_HOST_PTR, you can create a read-write buffer that wraps a piece of host memory. (The usual constraints apply: has to be page-aligned and even cacheline length on Intel, not sure about AMD's). You can create a second buffer wrapping the same region with different options (read only) and use it separately.
It's definitely illegal to use overlapped regions in different enqueues at the same time.
The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.
(from CreateBuffer) But barring that, it should work.
However, in the end, I strongly suspect you won't really gain anything. Implementations are free to ignore these flags. And I suspect that the overlap case above will force the implementation to ignore them (set the page access to the least restrictive combination of buffers mapping it). Integrated GPUs almost certainly will ignore those flags (I think Intel does).
What sort of optimizations were you hoping for?
My feeling is that it should depend on how you allocate the buffer initially. For some flags, you may reuse (you can try with alloc_host). Some may not allow you to do so.
Is there a way to change the flags of a opencl buffer once allocated?
No, it is not. You will have to create another buffer and call a copybuffer from one to another.
However I really doubt of the need of this. The memory flags affect (mainly) how the sync operation between host and device is performed. But when the memory is in the device, I doubt any optimizations can be done at all. (unless the memory consist of just some KB of data).
Even if optimizations are possible, the compiler should be clever enough to do it as well if the memory is declared in the kernel as constant or read_only. Regardless of the flags set to the memory buffer.