OpenCL on Xeon Phi: 2D Convolution Experience - OpenCL vs OpenMP - opencl

The performance of Xeon Phi benchmarked with 2D convolution in opnecl seems much better than an openmp implementation even with compiler-enabled vectorization. Openmp version was run in phi native mode, and timing measured only computation part: For-loop. For the opencl implementation, timing was only for kernel computation as well: no data transfer included. OpenMp-enbaled version was tested with 2,4,60,120,240 threads. - 240 threads gave the best performance for a balanced thread affinity setting. But Opencl was around 17x better even for the 240-thread openmp baseline with pragma-enbled vectorization is source code. Input image size is for 1024x1024 up to 16384x16384, and filter size of 3x3 up to 17x17. In call runs, opencl was better than openmp. Is this an expected speedup of opencl?? Seems too good to be true.
EDIT:
Compilation (openmp)
icc Convolve.cpp -fopenmp -mmic -O3 -vec-report1 -o conv.mic
Convolve.cpp(71): (col. 17) remark: LOOP WAS VECTORIZED
Source (Convole.cpp):
void Convolution_Threaded(float * pInput, float * pFilter, float * pOutput,
const int nInWidth, const int nWidth, const int nHeight,
const int nFilterWidth, const int nNumThreads)
{
#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
{
const int yInTopLeft = yOut;
for (int xOut = 0; xOut < nWidth; xOut++)
{
const int xInTopLeft = xOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
#pragma ivdep //discards any data dependencies assumed by compiler
#pragma vector aligned //all data accessed in the loop is properly aligned
for (int c = 0; c < nFilterWidth; c++)
{
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
}
}
Source 2 (convolve.cl)
__kernel void Convolve(const __global float * pInput,
__constant float * pFilter,
__global float * pOutput,
const int nInWidth,
const int nFilterWidth)
{
const int nWidth = get_global_size(0);
const int xOut = get_global_id(0);
const int yOut = get_global_id(1);
const int xInTopLeft = xOut;
const int yInTopLeft = yOut;
float sum = 0;
for (int r = 0; r < nFilterWidth; r++)
{
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++)
{
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
Result of OpenMP (in comparison with OpenCL):
image filter exec Time (ms)
OpenMP 2048x2048 3x3 23.4
OpenCL 2048x2048 3x3 1.04*
*Raw kernel execution time. Data transfer time over PCI bus not included.

Previously: (with #pragma ivdep and #pragma vector aligned for inner inner-most loop):
Compiler output:
Convolve.cpp(24): (col. 17) remark: LOOP WAS VECTORIZED
Program output:
120 Cores: 0.0087 ms
After advice by #jprice (with #pragma simd on horizontal-wise data):
Compiler output:
Convolve.cpp(24): (col. 9) remark: **SIMD** LOOP WAS VECTORIZED
Program output:
120 Cores: 0.00305
OpenMP now 2.8X faster compared to its previous execution. A fair comparison can now be made with OpenCL!
Thanks jprice and to everyone who contributed. Learnt huge lessons from you all.
EDIT:
Here are my results and comparison:
image filter exec Time (ms)
OpenMP 2048x2048 3x3 4.3
OpenCL 2048x2048 3x3 1.04
Speedup: 4.1X
Indeed OpenCL can be this faster than OpenMP ?

Intel's OpenCL implementation will use what they call "implicit vectorisation" in order to take advantage of vector floating point units. This involves mapping work-items onto SIMD lanes. In your example, each work-item is processing a single pixel, which means that each hardware thread will be processing 16 pixels at a time using the Xeon Phi's 512-bit vector units.
By contrast, your OpenMP code is parallelising across pixels, and then vectorising the computation within a pixel. This is almost certainly where the performance difference is coming from.
In order to get ICC to vectorize your OpenMP code in a manner that is similar to the implicitly vectorised OpenCL code, you should remove your #pragma ivdep and #pragma vector aligned statements from the innermost loop, and instead just place a #pragma simd in front of the horizontal pixel loop:
#pragma omp parallel for num_threads(nNumThreads)
for (int yOut = 0; yOut < nHeight; yOut++)
{
const int yInTopLeft = yOut;
#pragma simd
for (int xOut = 0; xOut < nWidth; xOut++)
{
When I compile this with ICC, it reports that it is successfully vectorising the desired loop.

Your OpenMP program use one thread for a row of image.The pixels in the same row are vectorized. It equals you have one dimension workgroup in OpenCL. Each workgroup process one row of image. But in your OpenCL code, it seems that you have a two dimension workgroup. Each workgroup(mapped into one thread on phi) is processing a BLOCK of the image, not a ROW of image. The cache hit will be different.

Related

OpenCL Optimization

Im new in OpenCL.
I wrote an OpenCL kernel to compute grayscale. How Can I optimize that code, is possible? Why the computational time is floating so much? Sometimes Im speedup others not. Im doing something wrong?
kernel code:
kernel void grayscale(__global unsigned char *input)
{
size_t i = get_global_id(0);
float grayscaleValue = (input[i*3] * 0.299F) + (input[i*3+1] * 0.587F) + (input[i*3+2] * 0.114F);
input[i*3] = grayscaleValue;
input[i*3+1] = grayscaleValue;
input[i*3+2] = grayscaleValue;
}
cpu code:
void GrayScaleCPU(struct PPMFile *ppmStruct)
{
for (int i = 0; i < ppmStruct->imageSize; i+=3)
{
float greyscaleValue = (ppmStruct->data[i] * 0.299F) + (ppmStruct->data[i+1] * 0.587F) + (ppmStruct->data[i+2] * 0.114F);
ppmStruct->out[i] = greyscaleValue;
ppmStruct->out[i+1] = greyscaleValue;
ppmStruct->out[i+2] = greyscaleValue;
}
}
int main(void)
{
struct timespec tS1, tS2;
tS1.tv_sec = 0;
tS1.tv_nsec = 0;
tS2.tv_sec = 0;
tS2.tv_nsec = 0;
...
clock_settime(CLOCK_REALTIME, &tS1);
GrayScaleCPU(ppmf);
clock_gettime(CLOCK_REALTIME, &tS1);
printf ("Timming took %.12lu seconds to run.\n", tS1.tv_nsec);
...
clock_settime(CLOCK_REALTIME, &tS2);
GrayScaleOpenCL(ppmf2);
clock_gettime(CLOCK_REALTIME, &tS2);
printf ("Timming took %.12lu seconds to run.\n", tS2.tv_nsec);
float time2 = tS2.tv_nsec;
float time1 = tS1.tv_nsec;
float speedup = time2/time1;
printf ("Speed UP OpenCL/CPU %.20f.\n", speedup);
return 0;
}
Try buffering your global memory into thread memory:
unsigned char l_input0 = input[i*3];
unsigned char l_input1 = input[i*3 + 1];
unsigned char l_input2 = input[i*3 + 2];
//compute grayscale using l_input0,1,2
input[i*3] = grayscale;
input[i*3 + 1] = grayscale;
input[i*3 + 2] = grayscale;
Also, if your data isn't spaced properly when you call your kernel, you may end up executing on each unsigned char, instead of every 3rd unsigned char as in your for loop example.
You can then go further using local memory and work groups and do your calculations in chunks, though that is more challenging as local work sizes are very device specific and need to be a multiple of the global work size. I've found local work sizes of 16, 32, and 64 work on most devices.
Finally, you benchmarking OpenCL, make sure you are measuring kernel performance and not kernel enqueue time. The easiest way to do this is to start a timer, enqueue you kernel, call clainish on the queue, then stop the timer. There are timing and profiling built into most OpenCL devices which are handled by the queue.

Copy portion of global array to local memory

I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.
Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:
__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
__local unsigned int n, length, leftlim, rightlim, i;
__local double sumx, sumy, x, y, xx, xy, invlen, a, b;
n = get_global_id(0);
leftlim = win_results[n*2];
rightlim = win_results[n*2+1];
sumx = 0;
sumy = 0;
xy = 0;
xx = 0;
length = rightlim - leftlim;
for(i = leftlim; i <= rightlim; i++) {
x = time[i]; /* I think this is fetched from global memory */
y = data[i];
sumx += x;
sumy += y;
xy += x*y;
xx += x*x;
}
invlen = 1.0/length;
a = xy-(sumx*sumy)*invlen;
b = xx-(sumx*sumx)*invlen;
result[n] = a/b;
}
I'm new to OpenCL, so please bear with me. Thanks!
The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.
AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)
You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples
I recommend you read, run, and customize several of these examples to learn.
I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.

boosting parallel reduction OpenCL

I have an algorithm, performing two-staged parallel reduction on GPU to find the smallest elemnt in a string. I know that there is a hint on how to make it work faster, but I don't know what it is. Any ideas on how I can tune this kernel to speed my program up? It is not necessary to actually change algorithm, may be there are other tricks. All ideas are welcome.
Thank you!
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
accumulator = (accumulator < element) ? accumulator : element;
Use fmin function - it is exactly what you need, and it may result in faster code (call to built-in instruction, if available, instead of costly branching)
global_index += get_global_size(0);
What is your typical get_global_size(0)?
Though your access pattern is not very bad (it is coalesced, 128byte chunks for 32-warp) - it is better to access memory sequentially whenever possible. For instance, sequential access may aid memory prefetching (note, OpenCL code can be executed on any device, including CPU).
Consider following scheme: each thread would process range
[ get_global_id(0)*delta , (get_global_id(0)+1)*delta )
It will result in fully sequential access.

OpenCL kernel work-group size restriction

So I keep running into strange errors when I call my kernels; the stated max kernel work-group size is one, while the work group size of my device (my Macbook) is decidedly higher than that. What possible causes could there be for the kernels restricting the code to a single work group? Here's one of my kernels:
__kernel
void termination_kernel(const int Elements,
__global float* c_I,
__global float* c_Ihat,
__global float* c_rI,
__local float* s_a)
{
const int bdim = 128;
int n = get_global_id(0);
const int tx = get_local_id(0); // thread index in thread-block (0-indexed)
const int bx = get_group_id(0); // block index (0-indexed)
const int gx = get_num_groups(0);
// is thread in range for the addition
float d = 0.f;
while(n < Elements){
d += pow(c_I[n] - c_Ihat[n], 2);
n += gx * bdim;
}
// assume bx power of 2
int alive = bdim / 2;
s_a[tx] = d;
barrier(CLK_LOCAL_MEM_FENCE);
while(alive > 1){
if(tx < alive)
s_a[tx] += s_a[tx + alive];
alive /= 2;
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tx == 0)
c_rI[bx] = s_a[0] + s_a[1];
}
and the error returned is
OpenCL Error (via pfn_notify): [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel
failed: total work group size (128) is greater than the device can support (1)
OpenCL Error: 'clEnqueueNDRangeKernel(queue, kernel_N, dim, NULL, global_N, local_N, 0, NULL, NULL)'
I know it says the restriction is on the device, but debugging shows that
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
and
CL_KERNEL_WORK_GROUP_SIZE = 1
The kernel construction is called by
char *KernelSource_T = readSource("Includes/termination_kernel.cl");
cl_program program_T = clCreateProgramWithSource(context, 1, (const char **) &KernelSource_T, NULL, &err);
clBuildProgram(program_T, 1, &device, flags, NULL, NULL);
cl_kernel kernel_T = clCreateKernel(program_T, "termination_kernel", &err);
I'd include the calling function, but I'm not sure if it's relevant; my intuition is that it's something in the kernel code that's forcing the restriction. Any ideas? Thanks in advance for the help!
Apple OpenCL doesn't support work-groups larger than [1, 1, 1] on the CPU. I have no idea why, but that's how it's been at least up to OSX 10.9.2. Larger work-groups are fine on the GPU, though.
CL_KERNEL_WORK_GROUP_SIZE tells you how large the maximum work group size can be for this particular kernel. OpenCL's runtime determines that by inspecting the kernel code. CL_KERNEL_WORK_GROUP_SIZE will be a number less or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.
Hope the amount of local memory avilable is less for that work group size . Please can you show the arguments? . You can try by reducing the work group size , start with 2,4,8,16,32,64,128 so on make sure its power of 2.
Time has passed since the answer of Tomi and it seems that Apple has become slightly more flexible on this aspect. On my OS X 10.12.3 (still OpenCL 1.2), it is possible to use up to CL_DEVICE_MAX_WORK_GROUP_SIZE in the first dimension.
According to the specification, it is also possible to get the maximum number of work-groups for each dimension through CL_DEVICE_MAX_WORK_ITEM_SIZES according to the documentation

OpenCL scalar vs vector

I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?
I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.
I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}

Resources