Parallel reduction using local memory in OpenCL - opencl

I implemented a reduce kernel in OpenCL to sum up all entries in the input vector of size N. For a easier testing I initialize the input vector with 1.0f. So the result should be N. But it is not!
Here is my reduce-kernel:
kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
const uint local_id = get_local_id(0);
const uint global_id = get_global_id(0);
const uint local_size = get_local_size(0);
cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
if (local_id < s) {
cache[local_id] += cache[local_id + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_id == 0) output[local_size] = cache[0];
}
And here is the setting for OpenCL:
const uint N = 8196;
cl_float a[N];
cl_float b[N];
for (uint i=0; i<N; i++) {
a[i] = 1.0f;
b[i] = 0.0f;
}
cl::Buffer inputBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*N);
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float)*N);
queue.enqueueWriteBuffer(inputBuffer, CL_TRUE, 0, sizeof(cl_float)*N, a);
queue.enqueueWriteBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b);
cl::Kernel addVectorKernel = cl::Kernel(program, "reduce");
size_t localSize = addVectorKernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device); // e.g. => 512
size_t globalSize = roundUp(localSize, N); // rounds up to a multiple of localSize
addVectorKernel.setArg(0, inputBuffer);
addVectorKernel.setArg(1, resultBuffer);
addVectorKernel.setArg(2, N);
addVectorKernel.setArg(3, (sizeof(cl_float) * localSize), NULL);
queue.enqueueNDRangeKernel(
addVectorKernel,
cl::NullRange,
cl::NDRange(globalSize),
cl::NDRange(localSize)
);
queue.finish(); // wait for ending
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b); // e.g. => 1024
The result depends on the workgroup size. What am I doing wrong? Is it the kernel itself or is it the settings for OpenCL?

You should be using the group's id when writing the sum back to global memory.
if (local_id == 0) output[local_size] = cache[0];
That line will write to output[512] repeatedly. You need each work group to write to a dedicated location in the output.
kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
const uint local_id = get_local_id(0);
const uint global_id = get_global_id(0);
const uint group_id = get_group_id(0);
const uint local_size = get_local_size(0);
cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
if (local_id < s) {
cache[local_id] += cache[local_id + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_id == 0) output[group_id] = cache[0];
}
Then you need to sum the values from the output on the host. Note that 'b' in the host code does not need to hold N elements. Only one element for each work group will be used.
//replace (globalSize/localSize) with the pre-calculated/known number of work groups
for (i=1; i<(globalSize/localSize); i++) {
b[0] += b[i];
}
Now b[0] is your grand total.

In the reduction for loop, you need this:
for(unsigned int s = localSize >> 1; s > 0; s >>= 1)
You are shifting one more bit than you should when initializing s.
After that's fixed, let's look at what your kernel is doing. The host code executes it with globalSize of 8192 and localSize of 512, which results in 16 work groups. Inside the kernel you first sum the data from the two consecutive memory locations at index 2*global_id. For work group with id 15, work item 0, that will be at index 15*512*2 = 15,360 and 15,361, which is outside the boundaries of your input array. I am surprised you don't get a crash. At the same time, this explains why you have double the values that you expect.
To fix it, you can do this:
cache[localID] = input[globalID];
Or specify a global size that's half of the number of the current one.

Related

vectorization and parallelization Xeon Phi

I am looking for an simple example where using vectorization and parallelization on Xeon Phi this has better perfomance than only-Xeon. Could you help me please?
I am trying with the next example. I comment the lines 14, 18 and 19 for run on only-Xeon and uncoment these for Xeon-Phi, but only-Xeon has better performance than Xeon-phi
1.void main(){
2.double *a, *b, *c;
3.int i,j,k, ok, n=100;
4.int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );
5.ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));
6.ok = posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));
7.ok = posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));
8.for(i=0; i<n; i++)
9.{
10. a[i] = (int) rand();
11. b[i] = (int) rand();
12. c[i] = 0.0;
13.}
14.#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))
15.#pragma omp parallel for
16.for( i = 0; i < n; i++ )
17. for( k = 0; k < n; k++ )
18. #pragma vector aligned
19. #pragma ivdep
20. for( j = 0; j < n; j++ ){
21. c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j]
22.}
First couple words about autovectorization. Advantage of autovectorization is simplicity. You need to set some keywords than magic happens and compiler make fast code for you. If you want to go this way try this manual.
The disadvantage of this approach is that there is no easy way to understand how compiler make his work. In vectorization report you will see "LOOP WAS VECTORIZED" or "LOOP WAS NOT VECTORIZED". But if you want truly understand how your code works the only way is look in your program assembly. This is not a problem to get assembly. You need to compile program with -fcode-asm. But I think if you need to read assembly to check how "simple autovectorization" method works it is not so simple.
Alternative to autovectorization are intrinsics (actually, this is not single alternative). Think about intrinsics like assembly wrapped with C functions. Many intrinsics internally wrap single assembly command.
I recommend to use this intrinsics guide.
So my simple way steps:
Make single thread reference implementation. You will use it to check correctness of intrinsics version.
Implement SSE intrinsics version. SSE intrinsics are much simpler and can be tested on Xeon.
Implement AVX-512 version for Xeon Phi.
Measure your speed.
Let's do it with your program.
There are many differences with your program:
I use float instead double.
I use _mm_malloc instead posix_memalign.
I suppose n is divided by 16 without remainder (16 floats in AVX-512 vector register). I don't work with loop peeling in this example.
I use native mode instead of offload mode. KNL is bootable so it is not necessary to use offload mode anymore.
Also I think your program is not correct because it modifies c array from several threads in one moment of time. But lets think it is not important and we just need some calculation job.
My code work time:
Intel Xeon 5680
reference calc time: 97.677505 seconds
Intrinsics calc time: 6.189296 seconds
Intel Xeon Phi (KNC) SE10X
reference calc time: 199.0 seconds
Intrinsics calc time: 2.78 seconds
Code:
#include <stdio.h>
#include <omp.h>
#include <math.h>
#include "immintrin.h"
#include <assert.h>
#define F_E_Q(X,Y,N) (round((X) * pow(10, N)-(Y) * pow(10, N)) == 0)
void reference(float* a, float* b, float* c, int n, int nPadded);
void intrinsics(float* a, float* b, float* c, int n, int nPadded);
char *test(){
int n=4800;
int nPadded = n;
assert(n%16 == 0);
float* a = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* b = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* cRef = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* c = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
assert(a != NULL);
assert(b != NULL);
assert(cRef != NULL);
assert(c != NULL);
for(int i=0, max = n*nPadded; i<max; i++){
a[i] = (int) rand() / 1804289408.0;
b[i] = (int) rand() / 1804289408.0;
cRef[i] = 0.0;
c[i] = 0.0;
}
debug_arr("a", "%f", a, 0, 9, 1);
debug_arr("b", "%f", b, 0, 9, 1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
double t1 = omp_get_wtime();
reference(a, b, cRef, n, nPadded);
double t2 = omp_get_wtime();
debug("reference calc time: %f", t2-t1);
t1 = omp_get_wtime();
intrinsics(a, b, c, n, nPadded);
t2 = omp_get_wtime();
debug("Intrinsics calc time: %f", t2-t1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
for(int i=0, max = n*nPadded; i<max; i++){
assert(F_E_Q(cRef[i], c[i], 2));
}
_mm_free(a);
_mm_free(b);
_mm_free(cRef);
_mm_free(c);
return NULL;
}
void reference(float* a, float* b, float* c, int n, int nPadded){
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j++ )
c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];
}
#if __MIC__
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=16 ){
__m512 aPart = _mm512_extload_ps(a + i*nPadded+k, _MM_UPCONV_PS_NONE, _MM_BROADCAST_1X16, _MM_HINT_NONE);
__m512 bPart = _mm512_load_ps(b + k*nPadded+j);
__m512 cPart = _mm512_load_ps(c + i*nPadded+j);
cPart = _mm512_add_ps(cPart, _mm512_mul_ps(aPart, bPart));
_mm512_store_ps(c + i*nPadded+j, cPart);
}
}
#else
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=4 ){
__m128 aPart = _mm_load_ps1(a + i*nPadded+k);
__m128 bPart = _mm_load_ps(b + k*nPadded+j);
__m128 cPart = _mm_load_ps(c + i*nPadded+j);
cPart = _mm_add_ps(cPart, _mm_mul_ps(aPart, bPart));
_mm_store_ps(c + i*nPadded+j, cPart);
}
}
#endif

OpenCL: Inserting local atomic_inc to reduction kernel

I am trying to include a local atomic similar to that described by DarkZeros here within a working reduction kernel. The kernel finds a largest value within a set of points; the aim of the local atomic is to allow me to filter selected point_ids into an output array without any gaps.
At present when I use the local atomic to increment the addition to a local array the kernel runs but produces a wrong overall highest point. If the atomic line is commented out then a correct result returns.
What is going on here and how do I fix it?
Simplified kernel code:
__kernel void reduce(__global const float4* dataSet, __global const int* input, const unsigned int items, //points and index
__global int* output, __local float4* shared, const unsigned int n, //finding highest
__global int* filtered, __global const float2* tri_input, const unsigned int pass, //finding filtered
__global int* global_count //global count
){
//set everything up
const unsigned int group_id = get_global_id(0) / get_local_size(0);
const unsigned int local_id = get_local_id(0);
const unsigned int group_size = items;
const unsigned int group_stride = 2 * group_size;
const int local_stride = group_stride * group_size;
__local float4 *zeroIt = &shared[local_id];
zeroIt->x = 0; zeroIt->y = 0; zeroIt->z = 0; zeroIt->w = 0;
volatile __local int local_count_set_1;
volatile __local int global_val_set_1;
volatile __local int filter_local[64];
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
int i = group_id * group_stride + local_id;
while (i < n){
//load up a pair of points using the index to locate them within a massive dataSet
int ia = input[i];
float4 a = dataSet[ia-1];
int ib = input[i + group_size];
float4 b = dataSet[ib-1];
//on the first pass kernel increment a local count
if(pass == 0){
filter_local[atomic_inc(&local_count_set_1)] = 1; //including this line causes an erroneous highest point result
//filter_local[local_id] = 1; //but including this line does not
//atomic_inc(&local_count_set_1); //and neither does this one
}
//find the highest of the pair
float4 result;
if(a.z>b.z) result = a;
else result = b;
//load up the previous highest result locally
float4 s = shared[local_id];
//if the previous highest beat this, stick, else twist
if(s.z>result.z){ result = s; }
shared[local_id] = result;
i += local_stride;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (group_size >= 512){
if (local_id < 256) {
__local float4 *a = &shared[local_id];
__local float4 *b = &shared[local_id+256];
if(b->z>a->z){ shared[local_id] = shared[local_id+256]; }
}}
//repeat barrier ops in increments down to group_size>=2 - this filters the highest result in shared
//finally, return the filtered highest result of shared to the global level
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id == 0){
__local float4 *v = &shared[0];
int send = v->w ;
output[group_id] = send+1;
}}
[UPDATE]: When the atomic_inc line is included the 'wrong' highest point result is always a point near the end of the test dataset. I'm guessing that this means that the atomic_inc is affecting a latter comparison, but I'm not sure exactly what or where yet.
[UPDATE]: Edited code to simplify/clarify/update with debugging tweaks. Still not working and it is driving me loopy.
Total face-palm moment. In the setup phase of the kernel there are the lines:
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
When these are split and the local_count_set_1 is included within the while loop, the error does not occur. i.e:
if(local_id==0) global_val_set_1 = -1;
barrier(CLK_LOCAL_MEM_FENCE);
while (i < n){
if(local_id==0) local_count_set_1 = 0;
barrier(CLK_LOCAL_MEM_FENCE);
....
if(pass = 0){
filter_local[atomic_inc(&local_count_set_1)] = 1;
}
....
I'm hoping this fixes the issue // will update if not.
Aaaand that's a weekend I'll never get back.

Optimal workgroup size for sum reduction in OpenCL

I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?
This question is a possible duplicate of one I answered a while back:
What is the algorithm to determine optimal work group size and number of workgroup.
Experimentation will be the best way to know for sure for any given device.
Update:
I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.
For each device:
1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.
3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.
//define your own context, kernel, queue here
int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err
size_t test_size;
//your vars for hi-res timer go here
for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
//reset timer
test_size = (size_t)i;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
if(err){
fail("Unable to enqueue kernel"); //implement your own fail function somewhere..
}else{
clfinish(queue);
//stop timer, save value
//output timer value and test_size
}
}
The device-specific kernel can look like this, except the first line should have your optimal value substituted:
#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
unsigned int tid = get_local_id(0);
unsigned int stride = get_global_id(0) * 2;
__local float sdata[LOCAL_SIZE];
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
if(tid < s){
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tid == 0) output[get_group_id(0)] = sdata[0];
}

OpenCL clEnqueueNDRangeKernel how to set work group size correctly

In OpenCL, if I want to add two N-dimension vectors, the global work group size (globalSize) should satisfy globalSize = ceil(N/localSize) * localSize, where localSize is the local work group size. Is this correct? If N = 1000, and localSize = 128, globalSize should be 1024? Can we always set globalSize some multiple of localSize and larger than needed?
I tried many times and it worked well for 1-dimension problems.
However, when it comes to 2d problems, for example, multiply two matrices of dimension m*n and n*p, the result matrix is of order m*p, things get more complicated.
The max work group size on my device is 128, so I set localSize [2] = {16,8} and
globalSize [2] = {ceil(m/16)*16,ceil(p/8)*8}.
It is similar to the 1-dimension case but the result is wrong!
If I set localSize [2] = {1,128} and change the globalSize accordingly, I can get the correct result. So where is the problem? Can anyone tell me why?
In addition, I find out the indices where the matrix element is wrong.
It seems that the result is wrong at (i,j) where i*p + j = n * some constant (n = 1,2,3...)
Why?
Here is my kernel function:
kernel void mmult(const int Mdim, const int Ndim, const int Pdim,
global float *A, global float *B, global float *C)
{
int i = get_global_id(1);
int j = get_global_id(0);
if(i < 0 || j < 0 || i > Mdim || j > Pdim) return;
else
{
float tmp = 0;
for(int k = 0; k < Ndim; k++)
tmp += A[i*Ndim+k] * B[k*Pdim+j];
C[i*Pdim + j] = tmp;
}
}
And then it is the host program:
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <utility>
#include <iostream>
#include <fstream>
#include <string>
#include <cmath>
using namespace cl;
int main()
{
// Create the two input matrices
int m = 1000;
int n = 1000;
int p = 1000;
float *A = new float[m*n];
float *B = new float[n*p];
for(int i = 0; i < m*n; i++)
{
A[i] = i;
}
for(int i = 0; i < n*p; i++)
{
B[i] = i;
}
try
{
// Get available platforms
vector<Platform> platforms;
Platform::get(&platforms);
// Select the default platform and create a context using this platform and the GPU
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms[0])(),
0
};
Context context( CL_DEVICE_TYPE_GPU, cps);
// Get a list of devices on this platform
vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
// Create a command queue and use the first device
CommandQueue queue = CommandQueue(context, devices[0]);
// Read source file
std::ifstream sourceFile("mmul.cl");
std::string sourceCode(
std::istreambuf_iterator<char>(sourceFile),
(std::istreambuf_iterator<char>()));
Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()+1));
// Make program of the source code in the context
Program program = Program(context, source);
// Build program for these specific devices
program.build(devices);
// Make kernel
Kernel kernel(program, "mmult");
// Create memory buffers
Buffer bufferA = Buffer(context, CL_MEM_READ_ONLY, m*n * sizeof(float));
Buffer bufferB = Buffer(context, CL_MEM_READ_ONLY, p*n * sizeof(float));
Buffer bufferC = Buffer(context, CL_MEM_WRITE_ONLY, m*p * sizeof(float));
// Copy lists A and B to the memory buffers
queue.enqueueWriteBuffer(bufferA, CL_TRUE, 0, m * n * sizeof(float), A);
queue.enqueueWriteBuffer(bufferB, CL_TRUE, 0, p * n * sizeof(float), B);
// Set arguments to kernel
kernel.setArg(0, m);
kernel.setArg(1, n);
kernel.setArg(2, p);
kernel.setArg(3, bufferA);
kernel.setArg(4, bufferB);
kernel.setArg(5, bufferC);
// Run the kernel on specific ND range
NDRange global((ceil((float)(p)/16))*16,(ceil((float)(m)/8))*8);
NDRange local(16,8);
queue.enqueueNDRangeKernel(kernel, NullRange, global, local);
// Read buffer C into a local list
float *C = new float[m*p];
queue.enqueueReadBuffer(bufferC, CL_TRUE, 0, m*p * sizeof(float), C);
// check the correctness of the result
float *c = new float[m*p];
for(int i = 0; i < m; i++)
for(int j = 0; j < p; j++)
{
float z = 0.0;
for(int k = 0; k < n; k++)
{
z += A[i*n+k] * B[k*p+j];
}
c[i*p+j] = z;
}
for(int i = 0; i < m*p; i++)
{
if(fabs(c[i]-C[i])>0.001)
std::cout<<i<<" "<<c[i]<<" "<<C[i]<<std::endl;
}
delete []A;
delete []B;
delete []C;
}
catch(Error error)
{
std::cout << error.what() << "(" << error.err() << ")" << std::endl;
}
return 0;
}
Your bounds checking code inside your OpenCL kernel is incorrect. Instead of this:
if(i < 0 || j < 0 || i > Mdim || j > Pdim) return;
You should have this:
if(i < 0 || j < 0 || i >= Mdim || j >= Pdim) return;
Let's assume, that you have float matrix of size 1000x1000:
const int size = 1000;
// Whatever
float* myMatrix = (float*)calloc(size * size, sizeof(*myMatrix));
Determine size of Local Group first:
size_t localSize[] = {16, 8};
Then determine, how many Local Groups do you need:
size_t numLocalGroups[] = {ceil(size/localSize[0]), ceil(size/localSize[1])};
Finally, determine NDRange size:
size_t globalSize[] = {localSize[0] * numLocalGroups[0], localSize[1] * numLocalGroups[1]};
Don't forget to handle out-of-bounds access in right-most Local Groups.

OpenCL Matrix multiplication: inner product versus outer product

I'm hoping everyone is familiar with the standard "naive" method of multiplying two (n x n square for simplicity) matrices. In C this is:
for(int i = 0; i < n; ++i)
for(int j = 0; j < n; ++j)
for(int k = 0; k < n; ++k)
C[i*n + j] += A[i*n + k] * B[k*n + j];
The above method computes the dot (inner) product of a row of A with a column of B and is easy to implement in OpenCL as follows:
__kernel void matmul_ocl(
__global const float *A,
__global const float *B,
__global float *C,
const int n
)
{
const int row = get_global_id(1); // row
const int col = get_global_id(0); // col
for(int i = 0; i < n; i++)
C[row*n + col] += A[row*n + i]*B[i*n + col];
}
Interchanging the two inner-most loops of the original C implementation results in a method that computes outer products, i.e., it computes rank-1 updates of the rows of the C matrix:
for(int i = 0; i < n; ++i)
for(int k = 0; k < n; ++k)
for(int j = 0; j < n; ++j)
C[i*n + j] += A[i*n + k] * B[k*n + j];
Does anybody know how to properly implement the above outer-product method in OpenCL? I have two of my attempts pasted below but I just can't seem to nail it
Attempt 1
__kernel void matmul_ocl(
__global const float *A,
__global const float *B,
__global float *C,
const int n
)
{
const int row = get_global_id(1); // row
const int col = get_global_id(0); // col
__local float r;
r = A[row*n + col];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < n; ++i)
C[row*n + i] += r * B[col*n + i];
}
Attempt 2
#define TS 1
__kernel void matmul_ocl(
__global const float *A,
__global const float *B,
__global float *C,
int n)
{
// Thread coordinates
const int row = get_local_id(1); // row
const int col = get_local_id(0); // col
// Group tile coordinates
const int by = get_group_id(1); // row
const int bx = get_group_id(0); // col
A += TS*by + TS*bx*n + n*row + (col);
B += TS*by*n + n*row + (col);
C += TS*bx*n + n*(row) + col;
__global const float *Blast = B + n;
float c[2] = {0.0f,0.0f};
float* cptr = &c[0];
__local float bs[2];
do
{
bs[0] = B[0];
bs[1] = B[n];
barrier(CLK_LOCAL_MEM_FENCE);
*cptr += A[0] * bs[0];
*cptr++ += A[0] * bs[1];
B++;
barrier(CLK_LOCAL_MEM_FENCE);
} while( B < Blast );
C[0] += c[0];
C[1] += c[1];
}
The OpenCL implementation of the common algorithm maps the outer two loops to the OpenCL NDRange implicit loops. This works because the outer two loops can be safely run in parallel.
There are a few problems with Attempt 1:
The __local variable r is assigned different values from multiple work-items simultaneously. There is a race condition here, the value of r is undefined. This could be fixed by just making r a private variable instead.
The more serious problem is that there is a race condition in the assignment of C. Every value of col (NDRange dimension 0) will be running its own loop over i in parallel.
There isn't a simple way around the second issue. The loop over k (in the transposed version) cannot be run in parallel. You can only map either the outer loop or the inner loop to a single dimensional NDRange in OpenCL.

Resources