clBuildProgram gets stuck with nested loops - opencl

clBuildProgram seems to get stuck without any error message when trying this kind of .cl-file:
__local int bar(int a, int b, int c, int d, int e)
{
return a*b*c*d; // 'e' not used
}
__kernel void foobar(__global int * notusedvariable)
{
int foo=1;
for (int a=1; a<=10; a++)
for (int b=1; b<=10; b++)
for (int c=1; c<=10; c++)
for (int d=1; d<=10; d++)
for (int e=1; e<=10; e++)
foo *= bar(a,b,c,d,e);
}
When I remove innermost loop and change foo *= bar(a,b,c,d,e); to foo *= bar(a,b,c,d,1); it compiles. So there is some kind of over-optimization or over-precalculation going on. This also happens if I have more loops and some of the variables are taken from get_global_id(...).
What can I do?
I use Fedora Linux 20, and have installed
opencl-utils-0-12.svn16.fc20.x86_64
opencl-1.2-intel-cpu-3.2.1.16712-1.x86_64
opencl-utils-devel-0-12.svn16.fc20.x86_64
opencl-1.2-base-3.2.1.16712-1.x86_64
GPU is Geforce 210, i.e. the cheapest one that I could find.

It is not really "stuck". It is just trapped in a hell of attempts to optimize the kernel. Primarily by unrolling the loops with the fixed size (and BTW, by finding out that the foo variable is not used at all!)
For example, when the loops a...d are enabled (and e switched off), then the binaries that are created for the kernel look like this:
.entry foobar(
.param .u32 .ptr .global .align 4 foobar_param_0
)
{
.reg .pred %p<4>;
.reg .s32 %r<13>;
mov.u32 %r10, 0;
BB0_1:
add.s32 %r10, %r10, 1;
mov.u32 %r11, 0;
BB0_2:
mov.u32 %r12, 10;
BB0_3:
add.s32 %r12, %r12, -2;
setp.ne.s32 %p1, %r12, 0;
#%p1 bra BB0_3;
add.s32 %r11, %r11, 1;
setp.ne.s32 %p2, %r11, 10;
#%p2 bra BB0_2;
setp.ne.s32 %p3, %r10, 10;
#%p3 bra BB0_1;
ret;
}
You can see that it is not really computing anyhting - and the compiler already has a hard time finding out that there is actually nothing to do.
Compare this to the output that is generated when you add the line
notusedvariable[0]=foo;
as the last line of the kernel: Now, the computations can not be skipped and optimized away. After quite a while of compiling, it produces the result
.entry foobar(
.param .u32 .ptr .global .align 4 foobar_param_0
)
{
.reg .pred %p<4>;
.reg .s32 %r<80>;
mov.u32 %r79, 1;
mov.u32 %r73, 0;
mov.u32 %r72, %r73;
BB0_1:
add.s32 %r7, %r73, 1;
add.s32 %r72, %r72, 2;
mov.u32 %r76, 0;
mov.u32 %r74, %r76;
mov.u32 %r73, %r7;
mov.u32 %r75, %r7;
BB0_2:
mov.u32 %r9, %r75;
add.s32 %r74, %r74, %r72;
mov.u32 %r78, 10;
mov.u32 %r77, 0;
BB0_3:
add.s32 %r40, %r9, %r77;
mul.lo.s32 %r41, %r40, %r79;
mul.lo.s32 %r42, %r40, %r41;
add.s32 %r43, %r74, %r77;
mul.lo.s32 %r53, %r42, %r40;
mul.lo.s32 %r54, %r53, %r40;
mul.lo.s32 %r55, %r54, %r40;
mul.lo.s32 %r56, %r55, %r40;
mul.lo.s32 %r57, %r56, %r40;
mul.lo.s32 %r58, %r57, %r40;
mul.lo.s32 %r59, %r58, %r40;
mul.lo.s32 %r60, %r59, %r40;
mul.lo.s32 %r61, %r60, %r43;
mul.lo.s32 %r62, %r61, %r43;
mul.lo.s32 %r63, %r62, %r43;
mul.lo.s32 %r64, %r63, %r43;
mul.lo.s32 %r65, %r64, %r43;
mul.lo.s32 %r66, %r65, %r43;
mul.lo.s32 %r67, %r66, %r43;
mul.lo.s32 %r68, %r67, %r43;
mul.lo.s32 %r69, %r68, %r43;
mul.lo.s32 %r70, %r69, %r43;
mul.lo.s32 %r79, %r70, -180289536;
add.s32 %r77, %r77, %r74;
add.s32 %r78, %r78, -2;
setp.ne.s32 %p1, %r78, 0;
#%p1 bra BB0_3;
add.s32 %r76, %r76, 1;
add.s32 %r30, %r9, %r7;
setp.ne.s32 %p2, %r76, 10;
mov.u32 %r75, %r30;
#%p2 bra BB0_2;
setp.ne.s32 %p3, %r7, 10;
#%p3 bra BB0_1;
ld.param.u32 %r71, [foobar_param_0];
st.global.u32 [%r71], %r79;
ret;
}
Obviously, it has unrolled some of the loops, now that he could not optimize them away any more. I assume that when loop "e" is also activated, the time that is required for this sort of unrolling (or for optimizing away the unused loops) increases at least quadratically. So if you give him a few hours, he might actually finish the compilation as well....
As Tom Fenech already said in https://stackoverflow.com/a/22011454 , this problem can be alleviated by passing -cl-opt-disable to clBuildProgram.
Alternatively, you can selectively switch off the unrolling optimization for each loop: When you insert
#pragma unroll 1
directly before a for-loop, you are effectively disabling unrolling for this particular loop.
Important Don't blindly insert the unroll pragma with arbitrary values. Using 1 is safe, but for other values, you have to manually make sure that it does not affect the correctness of the program. See the CUDA programming guide, section "B.21. #pragma unroll".
In this case, it seems to be sufficient to insert #pragma unroll 1 before the two innermost loops (d and e) in order to enable enough of the optimization to quickly build the program.
EDIT: sigh prunge was 4 minutes faster... :-(

The calculation that you are doing is going to get a very large number indeed!
I get the same problem as you on my hardware (NVIDIA GTX480) but I don't think this is hardware dependent. You are simply generating a number that is known at compile time and is too large to pre-compute for the int variable type. I changed int to long and the program now builds.
edit
I just tried this, using the Intel platform. It compiles fine. You can also make it work on NVIDIA by passing the switch -cl-opt-disable to clBuildProgram. This disables all optimisations - you may have some luck with some of the other compiler switches. See the clBuildProgram reference for details of those.

The Nvidia OpenCL compiler might be performing loop unrolling. If it does that for each of your nested loops that will result in a lot of code being generated!
There is the cl_nv_pragma_unroll Nvidia specific OpenCL extension
that can be used to have greater control over loop unrolling. To
quote its documentation:
A user may specify that a loop in the source program be unrolled. This
is done via a pragma. The syntax of this pragma is as follows
#pragma unroll [unroll-factor]
The pragma unroll may optionally specify an unroll factor. The pragma
must be placed immediately before the loop and only applies to that
loop.
If unroll factor is not specified then the compiler will try to do
complete or full unrolling of the loop. If a loop unroll factor is
specified the compiler will perform partial loop unrolling. The loop
factor, if specified, must be a compile time non negative integer
constant.
A loop unroll factor of 1 means that the compiler should not unroll
the loop.
A complete unroll specification has no effect if the trip count of the
loop is not compile-time computable.
By default, it sounds like it will unroll for certain low maximum limits (e.g. for 10 in your example) but will likely still unroll if they are nested (the unroll checking logic is probably not sophisticated enough to check for nested loops).
You could try #pragma unroll 1 to disable unrolling:
int foo=1;
#pragma unroll 1
for (int a=1; a<=10; a++)
#pragma unroll 1
for (int b=1; b<=10; b++)
#pragma unroll 1
for (int c=1; c<=10; c++)
#pragma unroll 1
for (int d=1; d<=10; d++)
#pragma unroll 1
for (int e=1; e<=10; e++)
foo *= bar(a,b,c,d,e);
You'll also want to enable the extension by putting:
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
at the top of your OpenCL source file as well.

Related

Can MPI_Bcast correctly while using multiple threads?

When I use the MPI, there are multiple threads doing MPI_Bcast. Like
#pragma omp parallel for
for(int i = 0; i < k; i++)
{
MPI_Bcast(&a[i], 1, MPI_INT32_T, TargetRank, MPI_COMM_WORLD);
}
If the data size and type are the same, it seems they will broadcast to the wrong places.
How could I fix it? (Now I use a my_bcast with Tag)
Problem Schematic

AMD OpenCL C Compiler notes dead and deleted loops which shouldn't be dead and deleted

I have the following loop executed in my OpenCl kernel:
__kernel void kernelA(/* many parameters */)
{
/* Prefetching code and other stuff
* ...
* ...
*/
float2 valueA = 0.0f;
#pragma unroll //<----- line X
for(unsigned int i = 0; i < MAX_A; i++) // MAX_A > 0
{
#pragma unroll
for(unsigned int j = 0; j < MAX_B; j++) // MAX_B > 0
valueA += arrayA[(i * MAX_A) + j];
}
/*
* Code that uses the result saved to valueA
*/
}
As can be seen clearly the loop shall summarize values contained in arrayA. Now I wanted to try the #pragma unroll to see whether there is any performance difference between looped and unrolled execution.
But when I compile the kernel, the compiler notes LOOP UNROLL: pragma unroll (line X) ignored because this loop is dead and deleted. I don't understand that information, because the code in the loop is surely executed. MAX_A and MAX_B are definitely greater than zero and the the sum saved to valueA is also used after the loop.
I have the same structure somewhere else in the code and also this position is marked by the upper note.
The compiler I use is the AMD OpenCL C compiler delivered by the APP SDK.
The comment by #DarkZeroes is the solution of this question. There was no instruction to put the result into an output array of the kernel, so the code above and everything what depended on that was optimized away by the compiler.

How to convince nvcc to use 128-bit wide loads?

I have a kernel that needs to apply a stencil operation on an array and store the result on another array. The stencil could be expressed in a function as:
float stencil(const float* data)
{
return *(data-1) + *(data+1);
}
I want every thread to produce 4 contiguous values of the output array by loading 6 contiguous values of the input array. By doing so I would be able to use the float4 type for loading and storing in chunks of 128 bytes. This is my program (you can download and compile it, but please consider the kernel in first place):
#include<iostream>
#include<cstdlib>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
__global__ void kernel(const float* input, float* output, int size)
{
int i = 4*(blockDim.x*blockIdx.x + threadIdx.x);
float values[6];
float res[4];
// Load values
values[0] = *(input+i-1);
*reinterpret_cast<float4*>(values+1) = *reinterpret_cast<const float4*>(input+i);
values[5] = *(input+i+4);
// Compute result
res[0] = values[0]+values[2];
res[1] = values[1]+values[3];
res[2] = values[2]+values[4];
res[3] = values[3]+values[5];
// Store result
*reinterpret_cast<float4*>(output+i) = *reinterpret_cast<const float4*>(res);
}
int main()
{
// Parameters
const int nBlocks = 8;
const int nThreads = 128;
const int nValues = 4 * nThreads * nBlocks;
// Allocate host and device memory
thrust::host_vector<float> input_host(nValues+64);
thrust::device_vector<float> input(nValues+64), output(nValues);
// Generate random input
srand48(42);
thrust::generate(input_host.begin(), input_host.end(), []{ return drand48()+1.; });
input = input_host;
// Run kernel
kernel<<<nBlocks, nThreads>>>(thrust::raw_pointer_cast(input.data()+32), thrust::raw_pointer_cast(output.data()), nValues);
// Check output
for (int i = 0; i < nValues; ++i)
{
float ref = input_host[31+i] + input_host[33+i];
if (ref != output[i])
{
std::cout << "Error at " << i << " : " << ref << " " << output[i] << "\n";
std::cout << "Abort with errors\n";
std::exit(1);
}
}
std::cout << "Success\n";
}
The program works perfectly.
I would expect the compiler to generate one LD.E.128 instruction for the central part of the local array values, and the registers for this central part to be contiguous (e.g. R4, R5, R6, R7); to have two LD.E instructions for both ends of values; to have one ST.E.128 for the output array.
What happens in reality is the following:
code for sm_21
Function : _Z6kernelPKfPfi
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ NOP; /* 0x4000000000001de4 */
/*0010*/ MOV32I R3, 0x4; /* 0x180000001000dde2 */
/*0018*/ S2R R0, SR_CTAID.X; /* 0x2c00000094001c04 */
/*0020*/ S2R R2, SR_TID.X; /* 0x2c00000084009c04 */
/*0028*/ IMAD R0, R0, c[0x0][0x8], R2; /* 0x2004400020001ca3 */
/*0030*/ SHL R6, R0, 0x2; /* 0x6000c00008019c03 */
/*0038*/ IMAD R10.CC, R6, R3, c[0x0][0x20]; /* 0x2007800080629ca3 */
/*0040*/ IMAD.HI.X R11, R6, R3, c[0x0][0x24]; /* 0x208680009062dce3 */
/*0048*/ IMAD R2.CC, R6, R3, c[0x0][0x28]; /* 0x20078000a0609ca3 */
/*0050*/ LD.E R4, [R10+0xc]; /* 0x8400000030a11c85 */
/*0058*/ IMAD.HI.X R3, R6, R3, c[0x0][0x2c]; /* 0x20868000b060dce3 */
/*0060*/ LD.E R7, [R10+0x4]; /* 0x8400000010a1dc85 */
/*0068*/ LD.E R9, [R10+-0x4]; /* 0x87fffffff0a25c85 */
/*0070*/ LD.E R5, [R10+0x8]; /* 0x8400000020a15c85 */
/*0078*/ LD.E R0, [R10+0x10]; /* 0x8400000040a01c85 */
/*0080*/ LD.E R8, [R10]; /* 0x8400000000a21c85 */
/*0088*/ FADD R6, R7, R4; /* 0x5000000010719c00 */
/*0090*/ FADD R4, R9, R7; /* 0x500000001c911c00 */
/*0098*/ FADD R7, R5, R0; /* 0x500000000051dc00 */
/*00a0*/ FADD R5, R8, R5; /* 0x5000000014815c00 */
/*00a8*/ ST.E.128 [R2], R4; /* 0x9400000000211cc5 */
/*00b0*/ EXIT; /* 0x8000000000001de7 */
................................
All loads are 32-bit wide (LD.E). On the other side, there is just one store instruction ST.E.128, as expected.
I don't show the whole code here again, but I did a test where the stencil does not need a value to the left, but only one to the right (e.g. *data + *(data+1)), in which case my values array contains just 5 values and the float4 load operation modifies the first 4 values of the array (I still have one extra load for the last value). In that case the compiler uses LD.E.128.
My question is why doesn't the compiler understand that it can use the 128-bit wide read if the target register is not the first one in the local array. After all the local array values is just a programming way to say that I need 6 floats to be stored in the registers. There is no such a thing like an array in the resulting ptx or SASS code. I thought I gave the compiler enough hints for it to understand LD.E.128 was the right instruction here.
Second question: how can I make it use the 128-wide load here without having to manually write low-level code? (However if a couple of asm instructions help I'm open to receive suggestions.)
Side note: the decision of using 32-bit load for reading the input and 128-bit store for writing the input is taken while producing ptx code. ptx code already shows this pattern of multiple small loads and a single large store.
I am using CUDA 7.5 under linux.
Based on the suggestions given in the comments, I did some experiments.
Declaring either input or output as __restrict__ (or both) solves the problem: the compiler generated a LD.E.128 and two LD.E, which is what I wanted to achieve, when generating code for the architecture sm_35. Strangely enough, when generating for sm_21 it still prduces six LD.E, but it produces one ST.E.128. It sounds like a compiler bug to me, because the instruction LD.E.128 should be perfectly usable in the older architecture as it is in the newest.
The code presented above uses the 128-bit loads just with the small change of using the __restrict__ keyword as suggested by njuffa and works. I did also follow the suggestion of m.s. I reproduced the same results shown in the pastebin snippet (one LD.E.128 + one LD.E.64). But at runtime it crashes with the following error:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): an illegal memory access was encountered
I'm pretty sure the misalignment is the cause of this problem.
Update: after using cuda-memcheck I'm sure the problem is misalignment:
========= Invalid __global__ read of size 16
========= at 0x00000060 in kernel(float const *, float*, int)
========= by thread (4,0,0) in block (7,0,0)
========= Address 0xb043638bc is misaligned
The problem is that the nvcc compiler is unable to resolve the base address for the vector load in your kernel. This can be a bug or is just an inadequacy.
I modified your code a little bit:
__global__ void kernel2(const float* input, float* output, int size)
{
int i = (blockDim.x*blockIdx.x + threadIdx.x);
float values[6];
float res[4];
// Load values
values[0] = *(input+(i*4)-1);
float4 test =*(reinterpret_cast<const float4*>(input)+i);
values[5] = *(input+(i*4)+4);
values[1] = test.x;
values[2] = test.y;
values[3] = test.z;
values[4] = test.w;
// Compute result
res[0] = values[0]+values[2];
res[1] = values[1]+values[3];
res[2] = values[2]+values[4];
res[3] = values[3]+values[5];
// Store result
*(reinterpret_cast<float4*>(output)+i) = *reinterpret_cast<const float4*>(res);
}
The kernel code compiled to ptx:
.visible .entry _Z7kernel2PKfPfi(
.param .u64 _Z7kernel2PKfPfi_param_0,
.param .u64 _Z7kernel2PKfPfi_param_1,
.param .u32 _Z7kernel2PKfPfi_param_2
)
{
.reg .f32 %f<15>;
.reg .b32 %r<7>;
.reg .b64 %rd<10>;
ld.param.u64 %rd1, [_Z7kernel2PKfPfi_param_0];
ld.param.u64 %rd2, [_Z7kernel2PKfPfi_param_1];
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r2, %r1, %r3;
shl.b32 %r5, %r4, 2;
add.s32 %r6, %r5, -1;
mul.wide.s32 %rd3, %r6, 4;
cvta.to.global.u64 %rd4, %rd1;
add.s64 %rd5, %rd4, %rd3;
ld.global.f32 %f1, [%rd5];
mul.wide.s32 %rd6, %r4, 16;
add.s64 %rd7, %rd4, %rd6;
ld.global.v4.f32 {%f2, %f3, %f4, %f5}, [%rd7];
ld.global.f32 %f10, [%rd5+20];
cvta.to.global.u64 %rd8, %rd2;
add.s64 %rd9, %rd8, %rd6;
add.f32 %f11, %f3, %f5;
add.f32 %f12, %f2, %f4;
add.f32 %f13, %f4, %f10;
add.f32 %f14, %f1, %f3;
st.global.v4.f32 [%rd9], {%f14, %f12, %f11, %f13};
ret;
}
You can see nicely how the addresses for the load are computed (%rd6 and %rd8).
While compiling your kernel to ptx results in:
.visible .entry _Z6kernelPKfPfi(
.param .u64 _Z6kernelPKfPfi_param_0,
.param .u64 _Z6kernelPKfPfi_param_1,
.param .u32 _Z6kernelPKfPfi_param_2
)
{
.reg .f32 %f<11>;
.reg .b32 %r<6>;
.reg .b64 %rd<8>;
ld.param.u64 %rd1, [_Z6kernelPKfPfi_param_0];
ld.param.u64 %rd2, [_Z6kernelPKfPfi_param_1];
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r2, %r1, %r3;
shl.b32 %r5, %r4, 2;
mul.wide.s32 %rd5, %r5, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f1, [%rd6+-4];
ld.global.f32 %f2, [%rd6];
ld.global.f32 %f3, [%rd6+12];
ld.global.f32 %f4, [%rd6+4];
ld.global.f32 %f5, [%rd6+8];
ld.global.f32 %f6, [%rd6+16];
add.s64 %rd7, %rd3, %rd5;
add.f32 %f7, %f5, %f6;
add.f32 %f8, %f4, %f3;
add.f32 %f9, %f2, %f5;
add.f32 %f10, %f1, %f4;
st.global.v4.f32 [%rd7], {%f10, %f9, %f8, %f7};
ret;
}
where the compiler only generates code to compute one address (%rd6) and uses static offsets. At this point the compiler failed to emit a vector load. Why? I honestly don't know, maybe two optimizations interfere here.
In SASS you see for kernel2:
.section .text._Z7kernel2PKfPfi,"ax",#progbits
.sectioninfo #"SHI_REGISTERS=18"
.align 64
.global _Z7kernel2PKfPfi
.type _Z7kernel2PKfPfi,#function
.size _Z7kernel2PKfPfi,(.L_39 - _Z7kernel2PKfPfi)
.other _Z7kernel2PKfPfi,#"STO_CUDA_ENTRY STV_DEFAULT"
_Z7kernel2PKfPfi:
.text._Z7kernel2PKfPfi:
/*0008*/ MOV R1, c[0x0][0x44];
/*0010*/ S2R R0, SR_CTAID.X;
/*0018*/ MOV R4, c[0x0][0x140];
/*0020*/ S2R R3, SR_TID.X;
/*0028*/ MOV R5, c[0x0][0x144];
/*0030*/ IMAD R3, R0, c[0x0][0x28], R3;
/*0038*/ MOV32I R8, 0x10;
/*0048*/ IMAD R16.CC, R3, 0x10, R4;
/*0050*/ ISCADD R0, R3, -0x1, 0x2;
/*0058*/ IMAD.HI.X R17, R3, 0x10, R5;
/*0060*/ IMAD R14.CC, R0, 0x4, R4;
/*0068*/ IMAD.HI.X R15, R0, 0x4, R5;
/*0070*/ LD.E.128 R4, [R16];
/*0078*/ LD.E R2, [R14];
/*0088*/ IMAD R12.CC, R3, R8, c[0x0][0x148];
/*0090*/ LD.E R0, [R14+0x14];
/*0098*/ IMAD.HI.X R13, R3, R8, c[0x0][0x14c];
/*00a0*/ FADD R9, R4, R6;
/*00a8*/ FADD R10, R5, R7;
/*00b0*/ FADD R8, R2, R5;
/*00b8*/ FADD R11, R6, R0;
/*00c8*/ ST.E.128 [R12], R8;
/*00d0*/ EXIT;
.L_1:
/*00d8*/ BRA `(.L_1);
.L_39:
Here you have your LD.E.128.
Compiled with nvcc release 7.5, V7.5.17.

OpenCL MultiGPU slower than single GPU

I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-

Simple Vector Geometric Progression Design in OpenCL

I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!

Resources