I have a kernel that needs to apply a stencil operation on an array and store the result on another array. The stencil could be expressed in a function as:
float stencil(const float* data)
{
return *(data-1) + *(data+1);
}
I want every thread to produce 4 contiguous values of the output array by loading 6 contiguous values of the input array. By doing so I would be able to use the float4 type for loading and storing in chunks of 128 bytes. This is my program (you can download and compile it, but please consider the kernel in first place):
#include<iostream>
#include<cstdlib>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
__global__ void kernel(const float* input, float* output, int size)
{
int i = 4*(blockDim.x*blockIdx.x + threadIdx.x);
float values[6];
float res[4];
// Load values
values[0] = *(input+i-1);
*reinterpret_cast<float4*>(values+1) = *reinterpret_cast<const float4*>(input+i);
values[5] = *(input+i+4);
// Compute result
res[0] = values[0]+values[2];
res[1] = values[1]+values[3];
res[2] = values[2]+values[4];
res[3] = values[3]+values[5];
// Store result
*reinterpret_cast<float4*>(output+i) = *reinterpret_cast<const float4*>(res);
}
int main()
{
// Parameters
const int nBlocks = 8;
const int nThreads = 128;
const int nValues = 4 * nThreads * nBlocks;
// Allocate host and device memory
thrust::host_vector<float> input_host(nValues+64);
thrust::device_vector<float> input(nValues+64), output(nValues);
// Generate random input
srand48(42);
thrust::generate(input_host.begin(), input_host.end(), []{ return drand48()+1.; });
input = input_host;
// Run kernel
kernel<<<nBlocks, nThreads>>>(thrust::raw_pointer_cast(input.data()+32), thrust::raw_pointer_cast(output.data()), nValues);
// Check output
for (int i = 0; i < nValues; ++i)
{
float ref = input_host[31+i] + input_host[33+i];
if (ref != output[i])
{
std::cout << "Error at " << i << " : " << ref << " " << output[i] << "\n";
std::cout << "Abort with errors\n";
std::exit(1);
}
}
std::cout << "Success\n";
}
The program works perfectly.
I would expect the compiler to generate one LD.E.128 instruction for the central part of the local array values, and the registers for this central part to be contiguous (e.g. R4, R5, R6, R7); to have two LD.E instructions for both ends of values; to have one ST.E.128 for the output array.
What happens in reality is the following:
code for sm_21
Function : _Z6kernelPKfPfi
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ NOP; /* 0x4000000000001de4 */
/*0010*/ MOV32I R3, 0x4; /* 0x180000001000dde2 */
/*0018*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0020*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0028*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0030*/ SHL R6, R0, 0x2; /* 0x6000c00008019c03 */
/*0038*/ IMAD R10.CC, R6, R3, c[0x0][0x20]; /* 0x2007800080629ca3 */
/*0040*/ IMAD.HI.X R11, R6, R3, c[0x0][0x24]; /* 0x208680009062dce3 */
/*0048*/ IMAD R2.CC, R6, R3, c[0x0][0x28]; /* 0x20078000a0609ca3 */
/*0050*/ LD.E R4, [R10+0xc]; /* 0x8400000030a11c85 */
/*0058*/ IMAD.HI.X R3, R6, R3, c[0x0][0x2c]; /* 0x20868000b060dce3 */
/*0060*/ LD.E R7, [R10+0x4]; /* 0x8400000010a1dc85 */
/*0068*/ LD.E R9, [R10+-0x4]; /* 0x87fffffff0a25c85 */
/*0070*/ LD.E R5, [R10+0x8]; /* 0x8400000020a15c85 */
/*0078*/ LD.E R0, [R10+0x10]; /* 0x8400000040a01c85 */
/*0080*/ LD.E R8, [R10]; /* 0x8400000000a21c85 */
/*0088*/ FADD R6, R7, R4; /* 0x5000000010719c00 */
/*0090*/ FADD R4, R9, R7; /* 0x500000001c911c00 */
/*0098*/ FADD R7, R5, R0; /* 0x500000000051dc00 */
/*00a0*/ FADD R5, R8, R5; /* 0x5000000014815c00 */
/*00a8*/ ST.E.128 [R2], R4; /* 0x9400000000211cc5 */
/*00b0*/ EXIT; /* 0x8000000000001de7 */
................................
All loads are 32-bit wide (LD.E). On the other side, there is just one store instruction ST.E.128, as expected.
I don't show the whole code here again, but I did a test where the stencil does not need a value to the left, but only one to the right (e.g. *data + *(data+1)), in which case my values array contains just 5 values and the float4 load operation modifies the first 4 values of the array (I still have one extra load for the last value). In that case the compiler uses LD.E.128.
My question is why doesn't the compiler understand that it can use the 128-bit wide read if the target register is not the first one in the local array. After all the local array values is just a programming way to say that I need 6 floats to be stored in the registers. There is no such a thing like an array in the resulting ptx or SASS code. I thought I gave the compiler enough hints for it to understand LD.E.128 was the right instruction here.
Second question: how can I make it use the 128-wide load here without having to manually write low-level code? (However if a couple of asm instructions help I'm open to receive suggestions.)
Side note: the decision of using 32-bit load for reading the input and 128-bit store for writing the input is taken while producing ptx code. ptx code already shows this pattern of multiple small loads and a single large store.
I am using CUDA 7.5 under linux.
Based on the suggestions given in the comments, I did some experiments.
Declaring either input or output as __restrict__ (or both) solves the problem: the compiler generated a LD.E.128 and two LD.E, which is what I wanted to achieve, when generating code for the architecture sm_35. Strangely enough, when generating for sm_21 it still prduces six LD.E, but it produces one ST.E.128. It sounds like a compiler bug to me, because the instruction LD.E.128 should be perfectly usable in the older architecture as it is in the newest.
The code presented above uses the 128-bit loads just with the small change of using the __restrict__ keyword as suggested by njuffa and works. I did also follow the suggestion of m.s. I reproduced the same results shown in the pastebin snippet (one LD.E.128 + one LD.E.64). But at runtime it crashes with the following error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): an illegal memory access was encountered
I'm pretty sure the misalignment is the cause of this problem.
Update: after using cuda-memcheck I'm sure the problem is misalignment:
========= Invalid __global__ read of size 16
========= at 0x00000060 in kernel(float const *, float*, int)
========= by thread (4,0,0) in block (7,0,0)
========= Address 0xb043638bc is misaligned
The problem is that the nvcc compiler is unable to resolve the base address for the vector load in your kernel. This can be a bug or is just an inadequacy.
I modified your code a little bit:
__global__ void kernel2(const float* input, float* output, int size)
{
int i = (blockDim.x*blockIdx.x + threadIdx.x);
float values[6];
float res[4];
// Load values
values[0] = *(input+(i*4)-1);
float4 test =*(reinterpret_cast<const float4*>(input)+i);
values[5] = *(input+(i*4)+4);
values[1] = test.x;
values[2] = test.y;
values[3] = test.z;
values[4] = test.w;
// Compute result
res[0] = values[0]+values[2];
res[1] = values[1]+values[3];
res[2] = values[2]+values[4];
res[3] = values[3]+values[5];
// Store result
*(reinterpret_cast<float4*>(output)+i) = *reinterpret_cast<const float4*>(res);
}
The kernel code compiled to ptx:
.visible .entry _Z7kernel2PKfPfi(
.param .u64 _Z7kernel2PKfPfi_param_0,
.param .u64 _Z7kernel2PKfPfi_param_1,
.param .u32 _Z7kernel2PKfPfi_param_2
)
{
.reg .f32 %f<15>;
.reg .b32 %r<7>;
.reg .b64 %rd<10>;
ld.param.u64 %rd1, [_Z7kernel2PKfPfi_param_0];
ld.param.u64 %rd2, [_Z7kernel2PKfPfi_param_1];
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r2, %r1, %r3;
shl.b32 %r5, %r4, 2;
add.s32 %r6, %r5, -1;
mul.wide.s32 %rd3, %r6, 4;
cvta.to.global.u64 %rd4, %rd1;
add.s64 %rd5, %rd4, %rd3;
ld.global.f32 %f1, [%rd5];
mul.wide.s32 %rd6, %r4, 16;
add.s64 %rd7, %rd4, %rd6;
ld.global.v4.f32 {%f2, %f3, %f4, %f5}, [%rd7];
ld.global.f32 %f10, [%rd5+20];
cvta.to.global.u64 %rd8, %rd2;
add.s64 %rd9, %rd8, %rd6;
add.f32 %f11, %f3, %f5;
add.f32 %f12, %f2, %f4;
add.f32 %f13, %f4, %f10;
add.f32 %f14, %f1, %f3;
st.global.v4.f32 [%rd9], {%f14, %f12, %f11, %f13};
ret;
}
You can see nicely how the addresses for the load are computed (%rd6 and %rd8).
While compiling your kernel to ptx results in:
.visible .entry _Z6kernelPKfPfi(
.param .u64 _Z6kernelPKfPfi_param_0,
.param .u64 _Z6kernelPKfPfi_param_1,
.param .u32 _Z6kernelPKfPfi_param_2
)
{
.reg .f32 %f<11>;
.reg .b32 %r<6>;
.reg .b64 %rd<8>;
ld.param.u64 %rd1, [_Z6kernelPKfPfi_param_0];
ld.param.u64 %rd2, [_Z6kernelPKfPfi_param_1];
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r2, %r1, %r3;
shl.b32 %r5, %r4, 2;
mul.wide.s32 %rd5, %r5, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f1, [%rd6+-4];
ld.global.f32 %f2, [%rd6];
ld.global.f32 %f3, [%rd6+12];
ld.global.f32 %f4, [%rd6+4];
ld.global.f32 %f5, [%rd6+8];
ld.global.f32 %f6, [%rd6+16];
add.s64 %rd7, %rd3, %rd5;
add.f32 %f7, %f5, %f6;
add.f32 %f8, %f4, %f3;
add.f32 %f9, %f2, %f5;
add.f32 %f10, %f1, %f4;
st.global.v4.f32 [%rd7], {%f10, %f9, %f8, %f7};
ret;
}
where the compiler only generates code to compute one address (%rd6) and uses static offsets. At this point the compiler failed to emit a vector load. Why? I honestly don't know, maybe two optimizations interfere here.
In SASS you see for kernel2:
.section .text._Z7kernel2PKfPfi,"ax",#progbits
.sectioninfo #"SHI_REGISTERS=18"
.align 64
.global _Z7kernel2PKfPfi
.type _Z7kernel2PKfPfi,#function
.size _Z7kernel2PKfPfi,(.L_39 - _Z7kernel2PKfPfi)
.other _Z7kernel2PKfPfi,#"STO_CUDA_ENTRY STV_DEFAULT"
_Z7kernel2PKfPfi:
.text._Z7kernel2PKfPfi:
/*0008*/ MOV R1, c[0x0][0x44];
/*0010*/ S2R R0, SR_CTAID.X;
/*0018*/ MOV R4, c[0x0][0x140];
/*0020*/ S2R R3, SR_TID.X;
/*0028*/ MOV R5, c[0x0][0x144];
/*0030*/ IMAD R3, R0, c[0x0][0x28], R3;
/*0038*/ MOV32I R8, 0x10;
/*0048*/ IMAD R16.CC, R3, 0x10, R4;
/*0050*/ ISCADD R0, R3, -0x1, 0x2;
/*0058*/ IMAD.HI.X R17, R3, 0x10, R5;
/*0060*/ IMAD R14.CC, R0, 0x4, R4;
/*0068*/ IMAD.HI.X R15, R0, 0x4, R5;
/*0070*/ LD.E.128 R4, [R16];
/*0078*/ LD.E R2, [R14];
/*0088*/ IMAD R12.CC, R3, R8, c[0x0][0x148];
/*0090*/ LD.E R0, [R14+0x14];
/*0098*/ IMAD.HI.X R13, R3, R8, c[0x0][0x14c];
/*00a0*/ FADD R9, R4, R6;
/*00a8*/ FADD R10, R5, R7;
/*00b0*/ FADD R8, R2, R5;
/*00b8*/ FADD R11, R6, R0;
/*00c8*/ ST.E.128 [R12], R8;
/*00d0*/ EXIT;
.L_1:
/*00d8*/ BRA `(.L_1);
.L_39:
Here you have your LD.E.128.
Compiled with nvcc release 7.5, V7.5.17.
I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-
I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!