OpenCL strange behavior

OpenCL strange behavior - opencl

Good day,
I think I've tried everything to figure out where the problem is but I couldn't. I have the following code for a host:
cl_mem cl_distances = clCreateBuffer(context, CL_MEM_READ_WRITE, 2 * sizeof(cl_uint), NULL, NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_distances);
cl_event event;
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_workers, &local_workers, 0, NULL, &event);
clWaitForEvents(1, &event);
And for a device:
__kernel void walk(__global uint *distance_results)
{
uint global_size = get_global_size(0);
uint local_size = get_local_size(0);
uint global_id = get_global_id(0);
uint group_id = get_group_id(0);
uint local_id = get_local_id(0);
for (uint step = 0; step < 500; step++) {
if (local_id == 0) {
distance_results[group_id] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (uint n = global_id; n < 1000; n += global_size) {
if (local_id == 0) {
atomic_add(&distance_results[group_id], 1);
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
if (global_id == 0) {
for (uint i = 0; i < (global_size / local_size); i++) {
printf("step: %d; group: %d; data: %d\n", step, i, distance_results[i]);
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
So at each "step" I just add one 1 to distance[group_id] 1000 times from each group. And then I just read the result from thread with global_id == 1.
At each step I should have the following text:
step: 59; group: 0; data: 500
step: 59; group: 1; data: 500
But actually there are a lot of strings with wrong data:
step: 4; group: 0; data: 500
step: 4; group: 1; data: 210
step: 5; group: 0; data: 500
step: 5; group: 1; data: 214
If I set global_workers to 1 and local_workers to 1 then everything is okay. But if I set global_workers to 2 and local_workers to 1 then I have this strange behavior.
Do you have any ideas why this can happen?

There's a couple things going on here, but I think the core problem comes from a very common misunderstanding with OpenCL. This call:
barrier(CLK_GLOBAL_MEM_FENCE);
This is not a global barrier. It is a local barrier with a global memory fence. In other words, it still only synchronizes between work items in a single work group, not between work items in other work groups.
The loop in the code that prints the results will only have correct values for work group 0, since it is only run in work group 0. If you really want this code to work, the loop that prints the results would have to be in a separate NDRange, with proper synchronization between the NDRanges.
The memory fence just controls which types of memory writes will be committed to memory. And in this case, you want global fences for both, since you are trying to fence global memory writes, not local memory writes.

Related

Do I need more events when timing multiple work-items?

If I have more than one work-item to execute some kernel code, do I need to have more events to track the execution time for each work-item?
I have some strange results, 1 work-item takes about 4 seconds to execute and 100 work-items also take about 4 seconds to execute. I can't see how this could be possible since my Nvidia GeForce GT 525M only has 2 compute units, each with 48 processing elements. This leads me to believe the event I listed as an argument in clEnqueueNDRangeKernel tracks only one work-item. Is that true and if so, how can I get it to track all the work-items?
This is what the Khronos user guide says about the event argument in clEnqueueNDRangeKernel:
event returns an event object that identifies this particular kernel execution instance
What is the meaning of "this particular kernel execution instance"? Isn't that a single work-item?
EDIT:
Relevant host code:
static const size_t numberOfWorkItems = 48;
const size_t globalWorkSize[] = { numberOfWorkItems, 0, 0 };
cl_event events;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, &events);
ret = clEnqueueReadBuffer(command_queue, memobj, CL_TRUE, 0, sizeof(cl_mem), val, 0, NULL, NULL);
clWaitForEvents(1, &events);
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &time_start, NULL);
clGetEventProfilingInfo(events, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &time_end, NULL);
double nanoSeconds = (double) (time_end - time_start);
printf("OpenCl Execution time is: %f milliseconds \n",nanoSeconds / 1000000.0);
printf("Result: %lu\n", val[0]);
Kernel code:
kernel void parallel_operation(__global ulong *val) {
size_t i = get_global_id(0);
int n = 48;
local unsigned int result[48];
for (int z = 0; z < n; z++) {
result[z] = 0;
}
// here comes the long operation
for (ulong k = 0; k < 2000; k++) {
for (ulong j = 0; j < 10000; j++) {
result[i] += (j * 3) % 5;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if (i == 0) {
for (int z = 1; z < n; z++) {
result[0] += result[z];
}
*val = result[0];
}
}

You are measuring the execution time of your entire kernel function. Or in other words, the time between the first work-item starts and the last work-item finishes. To my knowledge there is no possibility to measure the execution time of one single work-item in OpenCL.

MPI program runtime error MPI_GATHER, qsub mpijobparallel

I am trying to run this fast fourier implementation code. It compiles fine but gives this error at runtime. I have no idea about the error or what it means. Can anyone help me out?
I compiled and run the program by:
mpicc -o exec test.c
./exec
CODE:
This is the code that I found on GITHUB. Its the parallel version of fast fourier algorithm.
#include <stdio.h>
#include <mpi.h> //To use MPI
#include <complex.h> //to use complex numbers
#include <math.h> //for cos() and sin()
#include "timer.h" //to use timer
#define PI 3.14159265
#define bigN 16384 //Problem Size
#define howmanytimesavg 3
int main()
{
int my_rank,comm_sz;
MPI_Init(NULL,NULL); //start MPI
MPI_Comm_size(MPI_COMM_WORLD,&comm_sz); ///how many processes are we
using?
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //which process is this?
double start,finish;
double avgtime = 0;
FILE *outfile;
int h;
if(my_rank == 0) //if process 0 open outfile
{
outfile = fopen("ParallelVersionOutput.txt", "w"); //open from current
directory
}
for(h = 0; h < howmanytimesavg; h++) //loop to run multiple times for AVG
time.
{
if(my_rank == 0) //If it's process 0 starts timer
{
start = MPI_Wtime();
}
int i,k,n,j; //Basic loop variables
double complex evenpart[(bigN / comm_sz / 2)]; //array to save the data
for EVENHALF
double complex oddpart[(bigN / comm_sz / 2)]; //array to save the data
for ODDHALF
double complex evenpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for EVENHALF
double complex oddpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for ODDHALF
double storeKsumreal[bigN]; //store the K real variable so we can abuse
symmerty
double storeKsumimag[bigN]; //store the K imaginary variable so we can
abuse symmerty
double subtable[(bigN / comm_sz)][3]; //Each process owns a subtable
from the table below
double table[bigN][3] = //TABLE of numbers to use
{
0,3.6,2.6, //n, Real,Imaginary CREATES TABLE
1,2.9,6.3,
2,5.6,4.0,
3,4.8,9.1,
4,3.3,0.4,
5,5.9,4.8,
6,5.0,2.6,
7,4.3,4.1,
};
if(bigN > 8) //Everything after row 8 is all 0's
{
for(i = 8; i < bigN; i++)
{
table[i][0] = i;
for(j = 1; j < 3;j++)
{
table[i][j] = 0.0; //set to 0.0
}
}
}
int sendandrecvct = (bigN / comm_sz) * 3; //how much to send and
recieve??
MPI_Scatter(table,sendandrecvct,MPI_DOUBLE,subtable,sendandrecvct,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter the table to subtables
for (k = 0; k < bigN / 2; k++) //K coeffiencet Loop
{
/* Variables used for the computation */
double sumrealeven = 0.0; //sum of real numbers for even
double sumimageven = 0.0; //sum of imaginary numbers for even
double sumrealodd = 0.0; //sum of real numbers for odd
double sumimagodd = 0.0; //sum of imaginary numbers for odd
for(i = 0; i < (bigN/comm_sz)/2; i++) //Sigma loop EVEN and ODD
{
double factoreven , factorodd = 0.0;
int shiftevenonnonzeroP = my_rank * subtable[2*i][0]; //used to shift index numbers for correct results for EVEN.
int shiftoddonnonzeroP = my_rank * subtable[2*i + 1][0]; //used to shift index numbers for correct results for ODD.
/* -------- EVEN PART -------- */
double realeven = subtable[2*i][1]; //Access table for real number at spot 2i
double complex imaginaryeven = subtable[2*i][2]; //Access table for imaginary number at spot 2i
double complex componeeven = (realeven + imaginaryeven * I); //Create the first component from table
if(my_rank == 0) //if proc 0, dont use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((2*i)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((shiftevenonnonzeroP)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoeven = (cos(factoreven) - (sin(factoreven)*I)); //Create the second component
evenpart[i] = (componeeven * comptwoeven); //store in the evenpart array
/* -------- ODD PART -------- */
double realodd = subtable[2*i + 1][1]; //Access table for real number at spot 2i+1
double complex imaginaryodd = subtable[2*i + 1][2]; //Access table for imaginary number at spot 2i+1
double complex componeodd = (realodd + imaginaryodd * I); //Create the first component from table
if (my_rank == 0)//if proc 0, dont use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((2*i+1)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((shiftoddonnonzeroP)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoodd = (cos(factorodd) - (sin(factorodd)*I));//Create the second component
oddpart[i] = (componeodd * comptwoodd); //store in the oddpart array
}
/*Process ZERO gathers the even and odd part arrays and creates a evenpartmaster and oddpartmaster array*/
MPI_Gather(evenpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,evenpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
MPI_Gather(oddpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,oddpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
if(my_rank == 0)
{
for(i = 0; i < (bigN / comm_sz / 2) * comm_sz; i++) //loop to sum the EVEN and ODD parts
{
sumrealeven += creal(evenpartmaster[i]); //sums the realpart of the even half
sumimageven += cimag(evenpartmaster[i]); //sums the imaginarypart of the even half
sumrealodd += creal(oddpartmaster[i]); //sums the realpart of the odd half
sumimagodd += cimag(oddpartmaster[i]); //sums the imaginary part of the odd half
}
storeKsumreal[k] = sumrealeven + sumrealodd; //add the calculated reals from even and odd
storeKsumimag[k] = sumimageven + sumimagodd; //add the calculated imaginary from even and odd
storeKsumreal[k + bigN/2] = sumrealeven - sumrealodd; //ABUSE symmetry Xkreal + N/2 = Evenk - OddK
storeKsumimag[k + bigN/2] = sumimageven - sumimagodd; //ABUSE symmetry Xkimag + N/2 = Evenk - OddK
if(k <= 10) //Do the first 10 K's
{
if(k == 0)
{
fprintf(outfile," \n\n TOTAL PROCESSED SAMPLES : %d\n",bigN);
}
fprintf(outfile,"================================\n");
fprintf(outfile,"XR[%d]: %.4f XI[%d]: %.4f \n",k,storeKsumreal[k],k,storeKsumimag[k]);
fprintf(outfile,"================================\n");
}
}
}
if(my_rank == 0)
{
GET_TIME(finish); //stop timer
double timeElapsed = finish-start; //Time for that iteration
avgtime = avgtime + timeElapsed; //AVG the time
fprintf(outfile,"Time Elaspsed on Iteration %d: %f Seconds\n", (h+1),timeElapsed);
}
}
if(my_rank == 0)
{
avgtime = avgtime / howmanytimesavg; //get avg time
fprintf(outfile,"\nAverage Time Elaspsed: %f Seconds", avgtime);
fclose(outfile); //CLOSE file ONLY proc 0 can.
}
MPI_Barrier(MPI_COMM_WORLD); //wait to all proccesses to catch up before finalize
MPI_Finalize(); //End MPI
return 0;
}
ERROR:
Fatal error in PMPI_Gather: Invalid datatype, error stack:
PMPI_Gather(904): MPI_Gather(sbuf=0x7fffb62799a0, scount=8192,
MPI_DATATYPE_NULL, rbuf=0x7fffb6239980, rcount=8192, MPI_DATATYPE_NULL,
root=0, MPI_COMM_WORLD) failed
PMPI_Gather(815): Datatype for argument sendtype is a null datatype
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=537490947
:
system msg for write_line failure : Bad file descriptor

There is no MPI_DATATYPE_NULL in your code, but you only use MPI_DOUBLE_COMPLEX. Note the latter type is a Fortran datatype, and using it in C is not correct strictly speaking.
My guess is that MPI_DOUBLE_COMPLEX is causing the issue (type not defined or not initialized because you invoked the C version of MPI_Init()).
You can obviously rewrite your code in Fortran, or use your own derived datatype for a C double complex number.
Meanwhile, I suggest you write simple C and Fortran helloworld programs that use MPI_DOUBLE_COMPLEX (MPI_Bcast() of one element for example) to confirm the issue is with MPI_DOUBLE_COMPLEX and is restricted to C or not.

OpenCl wrong values when reading from multiple GPU

I have a kernel function that only writes number to a __global int* c
To be specific it looks like this:
__kernel void Add1(__global int* c)
{
*c = 3;
}
and in host code I have allocated memory for C value:
cl_mem bufferC[deviceNumber]; // deviceNumber = 8
for(int i = 0; i< deviceNumber; i++){
bufferC[i] = clCreateBuffer(context[i], CL_MEM_WRITE_ONLY, sizeof(cl_int) * global_size, NULL, &error);
}
for(int i = 0; i< deviceNumber; i++){
error = clSetKernelArg(kernel[i], 0, sizeof(cl_mem), (void*)&bufferC[i]);
}
for(int i = 0; i< deviceNumber; i++){
error = clEnqueueReadBuffer(commandQueue[i], bufferC[i], CL_TRUE, 0, sizeof(cl_int) * global_size, &c[i], 0, NULL, NULL);
}
and I print it like:
for (size_t i = 0; i < deviceNumber; ++i)
{
std::cout<< "delta = " << c[i] << std::endl;
}
and output:
delta = 3
delta = 11165
delta = -1329524360
delta = 11165
delta = 0
delta = 0
delta = -1329520352
delta = 11165
so first value is ok, rest is sort of garbage, do you know what mistake I made writing it?
Of course it is only a partial code, but I think I pasted all the lines regarding that 'c' value. Global size is set to 1.

Well, my mistake was creating number of contexts but in argument I put one device instead of a array of them. But I found it by printing error codes in program - try to do that if you have some problems! Cheers

Optimal workgroup size for sum reduction in OpenCL

I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?

This question is a possible duplicate of one I answered a while back:
What is the algorithm to determine optimal work group size and number of workgroup.
Experimentation will be the best way to know for sure for any given device.
Update:
I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.
For each device:
1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.
3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.
//define your own context, kernel, queue here
int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err
size_t test_size;
//your vars for hi-res timer go here
for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
//reset timer
test_size = (size_t)i;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
if(err){
fail("Unable to enqueue kernel"); //implement your own fail function somewhere..
}else{
clfinish(queue);
//stop timer, save value
//output timer value and test_size
}
}
The device-specific kernel can look like this, except the first line should have your optimal value substituted:
#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
unsigned int tid = get_local_id(0);
unsigned int stride = get_global_id(0) * 2;
__local float sdata[LOCAL_SIZE];
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
if(tid < s){
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tid == 0) output[get_group_id(0)] = sdata[0];
}

A recursion algorithm

Ok, this may seem trivial to some, but I'm stuck.
Here's the algorithm I'm supposed to use:
Here’s a recursive algorithm. Suppose we have n integers in a non-increasing sequence, of which the first is the number k. Subtract one from each of the first k numbers after the first. (If there are fewer than k such number, the sequence is not graphical.) If necessary, sort the resulting sequence of n-1 numbers (ignoring the first one) into a non-increasing sequence. The original sequence is graphical if and only if the second one is. For the stopping conditions, note that a sequence of all zeroes is graphical, and a sequence containing a negative number is not. (The proof of this is not difficult, but we won’t deal with it here.)
Example:
Original sequence: 5, 4, 3, 3, 2, 1, 1
Subtract 1 five times: 3, 2, 2, 1, 0, 1
Sort: 3, 2, 2, 1, 1, 0
Subtract 1 three times: 1, 1, 0, 1, 0
Sort: 1, 1, 1, 0, 0
Subtract 1 once: 0, 1, 0, 0
Sort: 1, 0, 0, 0
Subtract 1 once: -1, 0, 0
We have a negative number, so the original sequence is not graphical.
This seems simple enough to me, but when I try to execute the algorithm I get stuck.
Here's the function I've written so far:
//main
int main ()
{
//local variables
const int MAX = 30;
ifstream in;
ofstream out;
int graph[MAX], size;
bool isGraph;
//open and test file
in.open("input3.txt");
if (!in) {
cout << "Error reading file. Exiting program." << endl;
exit(1);
}
out.open("output3.txt");
while (in >> size) {
for (int i = 0; i < size; i++) {
in >> graph[i];
}
isGraph = isGraphical(graph, 0, size);
if (isGraph) {
out << "Yes\n";
}else
out << "No\n";
}
//close all files
in.close();
out.close();
cin.get();
return 0;
}//end main
bool isGraphical(int degrees[], int start, int end){
bool isIt = false;
int ender;
inSort(degrees, end);
ender = degrees[start] + start + 1;
for(int i = 0; i < end; i++)
cout << degrees[i];
cout << endl;
if (degrees[start] == 0){
if(degrees[end-1] < 0)
return false;
else
return true;
}
else{
for(int i = start + 1; i < ender; i++) {
degrees[i]--;
}
isIt = isGraphical(degrees, start+1, end);
}
return isIt;
}
void inSort(int x[],int length)
{
for(int i = 0; i < length; ++i)
{
int current = x[i];
int j;
for(j = i-1; j >= 0 && current > x[j]; --j)
{
x[j+1] = x[j];
}
x[j+1] = current;
}
}
I seem to get what that sort function is doing, but when I debug, the values keep jumping around. Which I assume is coming from my recursive function.
Any help?
EDIT:
Code is functional. Please see the history if needed.
With help from #RMartinhoFernandes I updated my code. Includes working insertion sort.
I updated the inSort funcion boundaries
I added an additional ending condition from the comments. But the algorithm still isn't working. Which makes me thing my base statements are off. Would anyone be able to help further? What am I missing here?

Ok, I helped you out in chat, and I'll post a summary of the issues you had here.
The insertion sort inner loop should go backwards, not forwards. Make it for(i = (j - 1); (i >= 0) && (key > x[i]); i--);
There's an out-of-bounds access in the recursion base case: degrees[end] should be degrees[end-1];
while (!in.eof()) will not read until the end-of-file. while(in >> size) is a superior alternative.

Are you sure you ender do not go beyond end? Value of ender is degrees[start] which could go beyond the value of end.
Then you are using ender in for loop
for(int i = start+1; i < ender; i++){ //i guess it should be end here

I think your insertion sort algorithm isn't right. Try this one (note that this sorts it in the opposite order from what you want though). Also, you want
for(int i = start + 1; i < ender + start + 1; i++) {
instead of
for(int i = start+1; i < ender; i++)
Also, as mentioned in the comments, you want to check if degrees[end - 1] < 0 instead of degrees[end].

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

OpenCL strange behavior - opencl

Related

Do I need more events when timing multiple work-items?

MPI program runtime error MPI_GATHER, qsub mpijobparallel

OpenCl wrong values when reading from multiple GPU

Optimal workgroup size for sum reduction in OpenCL

A recursion algorithm

Categories

Resources