Writing to Global Memory Causing Crash in OpenCL in For Loop - opencl

One of my OpenCL helper functions writing to global memory in one place runs just fine, and the kernel executes typically. Still, when run from directly after that line, it freezes/crashes the kernel, and my program can't function.
The values in this function change (different values for an NDRange of 2^16), and therefore the loops change as well, and not all threads can execute the same code because of the conditionals.
Why exactly is this an issue? Am I missing some kind of memory blocking or something?
void add_world_seeds(yada yada yada...., const uint global_id, __global long* world_seeds)
for (; indexer < (1 << 16); indexer += increment) {
long k = (indexer << 16) + c;
long target2 = (k ^ e) >> 16;
long second_addend = get_partial_addend(k, x, z) & MASK_16;
if (ctz(target2 - second_addend) < mult_trailing_zeroes) { continue; }
long a = (((first_mult_inv * (target2 - second_addend)) >> mult_trailing_zeroes) ^ (J1_MUL >> 32)) & mask;
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE

There was in fact a bug causing the undefined behavior in the code, in particular the main reversal kernel included a variable in the arguments called "increment", and in that same kernel I defined another variable called increment. It compiled fine but led to completely all over the wall wrong results and memory crashes.


Weird behavior of dpc++ code after running it on FPGA device

I am using DPC++ to accelerate knn algorithm on FPGA device. The following code is the code I wrote for the euclidean distance. The problem is that the fpga_emulation works very well with no problems while running it on fpga hardware (Intel Arria 10 OneAPI) gives -nan for all values in the resulting buffer, which means something got wrong in the parallel_for lioop. But I can't find anything wrong about it and the emulation worked.
I am using Intel Devcloud platform.
std::vector<double> distance_calculation_FPGA(queue& q, const std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) {
std::cout<<"convert 2D to 1D"<<std::endl;
for (int i = 0; i < dataset.size(); ++i) {
for (int j = 0; j < dataset[i].size(); ++j) {
range<1> num_items{dataset.size()};
//std::cout << "im in" << std::endl;
buffer dataset_buf(linear_dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
std::cout<<"submit a job"<<std::endl;
auto start = std::chrono::high_resolution_clock::now();
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, write_only, no_init);
h.parallel_for(num_items, [=](auto i) {
for (int j = 0; j < 5; ++j) {
dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);
// out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
/* Iterative distance calculation
for (int i = 0; i < dataset.size(); ++i) {
double dis = 0;
for (int j = 0; j < dataset[i].size(); ++j) {
dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
return res;
results with fpga_emulation: ./knn.fpga_emu
results for fpga hardware: ./knn.fpga
Question on your usage, usually with something like a NaN obviously we are looking at uninitialized memory (or divide by 0 which you don't have). Is it possible the ranges are some how off on the FGPA and/or the values aren't properly initialized for the array incidies?
Sorry I know that's pretty basic, but without your dataset I'm not 100% sure I can reproduce it.

MPI program runtime error MPI_GATHER, qsub mpijobparallel

I am trying to run this fast fourier implementation code. It compiles fine but gives this error at runtime. I have no idea about the error or what it means. Can anyone help me out?
I compiled and run the program by:
mpicc -o exec test.c
This is the code that I found on GITHUB. Its the parallel version of fast fourier algorithm.
#include <stdio.h>
#include <mpi.h> //To use MPI
#include <complex.h> //to use complex numbers
#include <math.h> //for cos() and sin()
#include "timer.h" //to use timer
#define PI 3.14159265
#define bigN 16384 //Problem Size
#define howmanytimesavg 3
int main()
int my_rank,comm_sz;
MPI_Init(NULL,NULL); //start MPI
MPI_Comm_size(MPI_COMM_WORLD,&comm_sz); ///how many processes are we
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //which process is this?
double start,finish;
double avgtime = 0;
FILE *outfile;
int h;
if(my_rank == 0) //if process 0 open outfile
outfile = fopen("ParallelVersionOutput.txt", "w"); //open from current
for(h = 0; h < howmanytimesavg; h++) //loop to run multiple times for AVG
if(my_rank == 0) //If it's process 0 starts timer
start = MPI_Wtime();
int i,k,n,j; //Basic loop variables
double complex evenpart[(bigN / comm_sz / 2)]; //array to save the data
double complex oddpart[(bigN / comm_sz / 2)]; //array to save the data
double complex evenpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for EVENHALF
double complex oddpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for ODDHALF
double storeKsumreal[bigN]; //store the K real variable so we can abuse
double storeKsumimag[bigN]; //store the K imaginary variable so we can
abuse symmerty
double subtable[(bigN / comm_sz)][3]; //Each process owns a subtable
from the table below
double table[bigN][3] = //TABLE of numbers to use
0,3.6,2.6, //n, Real,Imaginary CREATES TABLE
if(bigN > 8) //Everything after row 8 is all 0's
for(i = 8; i < bigN; i++)
table[i][0] = i;
for(j = 1; j < 3;j++)
table[i][j] = 0.0; //set to 0.0
int sendandrecvct = (bigN / comm_sz) * 3; //how much to send and
MPI_Scatter(table,sendandrecvct,MPI_DOUBLE,subtable,sendandrecvct,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter the table to subtables
for (k = 0; k < bigN / 2; k++) //K coeffiencet Loop
/* Variables used for the computation */
double sumrealeven = 0.0; //sum of real numbers for even
double sumimageven = 0.0; //sum of imaginary numbers for even
double sumrealodd = 0.0; //sum of real numbers for odd
double sumimagodd = 0.0; //sum of imaginary numbers for odd
for(i = 0; i < (bigN/comm_sz)/2; i++) //Sigma loop EVEN and ODD
double factoreven , factorodd = 0.0;
int shiftevenonnonzeroP = my_rank * subtable[2*i][0]; //used to shift index numbers for correct results for EVEN.
int shiftoddonnonzeroP = my_rank * subtable[2*i + 1][0]; //used to shift index numbers for correct results for ODD.
/* -------- EVEN PART -------- */
double realeven = subtable[2*i][1]; //Access table for real number at spot 2i
double complex imaginaryeven = subtable[2*i][2]; //Access table for imaginary number at spot 2i
double complex componeeven = (realeven + imaginaryeven * I); //Create the first component from table
if(my_rank == 0) //if proc 0, dont use shiftevenonnonzeroP
factoreven = ((2*PI)*((2*i)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
else //use shiftevenonnonzeroP
factoreven = ((2*PI)*((shiftevenonnonzeroP)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
double complex comptwoeven = (cos(factoreven) - (sin(factoreven)*I)); //Create the second component
evenpart[i] = (componeeven * comptwoeven); //store in the evenpart array
/* -------- ODD PART -------- */
double realodd = subtable[2*i + 1][1]; //Access table for real number at spot 2i+1
double complex imaginaryodd = subtable[2*i + 1][2]; //Access table for imaginary number at spot 2i+1
double complex componeodd = (realodd + imaginaryodd * I); //Create the first component from table
if (my_rank == 0)//if proc 0, dont use shiftoddonnonzeroP
factorodd = ((2*PI)*((2*i+1)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
else //use shiftoddonnonzeroP
factorodd = ((2*PI)*((shiftoddonnonzeroP)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
double complex comptwoodd = (cos(factorodd) - (sin(factorodd)*I));//Create the second component
oddpart[i] = (componeodd * comptwoodd); //store in the oddpart array
/*Process ZERO gathers the even and odd part arrays and creates a evenpartmaster and oddpartmaster array*/
MPI_Gather(evenpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,evenpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
MPI_Gather(oddpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,oddpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
if(my_rank == 0)
for(i = 0; i < (bigN / comm_sz / 2) * comm_sz; i++) //loop to sum the EVEN and ODD parts
sumrealeven += creal(evenpartmaster[i]); //sums the realpart of the even half
sumimageven += cimag(evenpartmaster[i]); //sums the imaginarypart of the even half
sumrealodd += creal(oddpartmaster[i]); //sums the realpart of the odd half
sumimagodd += cimag(oddpartmaster[i]); //sums the imaginary part of the odd half
storeKsumreal[k] = sumrealeven + sumrealodd; //add the calculated reals from even and odd
storeKsumimag[k] = sumimageven + sumimagodd; //add the calculated imaginary from even and odd
storeKsumreal[k + bigN/2] = sumrealeven - sumrealodd; //ABUSE symmetry Xkreal + N/2 = Evenk - OddK
storeKsumimag[k + bigN/2] = sumimageven - sumimagodd; //ABUSE symmetry Xkimag + N/2 = Evenk - OddK
if(k <= 10) //Do the first 10 K's
if(k == 0)
fprintf(outfile," \n\n TOTAL PROCESSED SAMPLES : %d\n",bigN);
fprintf(outfile,"XR[%d]: %.4f XI[%d]: %.4f \n",k,storeKsumreal[k],k,storeKsumimag[k]);
if(my_rank == 0)
GET_TIME(finish); //stop timer
double timeElapsed = finish-start; //Time for that iteration
avgtime = avgtime + timeElapsed; //AVG the time
fprintf(outfile,"Time Elaspsed on Iteration %d: %f Seconds\n", (h+1),timeElapsed);
if(my_rank == 0)
avgtime = avgtime / howmanytimesavg; //get avg time
fprintf(outfile,"\nAverage Time Elaspsed: %f Seconds", avgtime);
fclose(outfile); //CLOSE file ONLY proc 0 can.
MPI_Barrier(MPI_COMM_WORLD); //wait to all proccesses to catch up before finalize
MPI_Finalize(); //End MPI
return 0;
Fatal error in PMPI_Gather: Invalid datatype, error stack:
PMPI_Gather(904): MPI_Gather(sbuf=0x7fffb62799a0, scount=8192,
MPI_DATATYPE_NULL, rbuf=0x7fffb6239980, rcount=8192, MPI_DATATYPE_NULL,
root=0, MPI_COMM_WORLD) failed
PMPI_Gather(815): Datatype for argument sendtype is a null datatype
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=537490947
system msg for write_line failure : Bad file descriptor
There is no MPI_DATATYPE_NULL in your code, but you only use MPI_DOUBLE_COMPLEX. Note the latter type is a Fortran datatype, and using it in C is not correct strictly speaking.
My guess is that MPI_DOUBLE_COMPLEX is causing the issue (type not defined or not initialized because you invoked the C version of MPI_Init()).
You can obviously rewrite your code in Fortran, or use your own derived datatype for a C double complex number.
Meanwhile, I suggest you write simple C and Fortran helloworld programs that use MPI_DOUBLE_COMPLEX (MPI_Bcast() of one element for example) to confirm the issue is with MPI_DOUBLE_COMPLEX and is restricted to C or not.

Locking not working with OpenCL

I'm stuck with an issue in my OpenCL code where I try to synch inside a kernel:
__kernel void pdiffs (__global const long2 *inData, __global const long2 *inData2, __global long2 *outData) {
long2 diffSum = 0;
uint idx0 = get_local_size(0)*get_group_id(0);
for (uint idx=idx0; idx<idx0+get_local_size(0); idx += 1) {
diffSum += inData[idx] - inData2[idx];
outData[get_group_id(0)] = diffSum;
printf("%d %d %d %d/%d\n", get_group_id(0), get_num_groups(0), get_local_size(0), diffSum.x, diffSum.y);
if (get_group_id(0) == 0) {
for (size_t i = 1; i < get_num_groups(0); i++){
outData[0] += outData[i];
printf("v(%d): %d/%d\n", i, outData[i].x, outData[i].y);
(I know that this piece of code is simply bad...)
I just thought that the barrier will synch the single groups so the values in outData are defined. But my trace shows that some differences are not calculated and contain zero (I setup my data so all shall return the value 1, but some display as 0). Further it makes a difference whether I have printf statements or not. Without printf even more differences seem to be incorrect.

OpenCL: Expected identifier in kernel

I am running the following kernel on windows 7, 64 bit, with Intel CPU and HD graphics.
I get very strange error reporting by clGetProgramBuildInfo for the following code:
#define BLOCK_SIZE 256
__kernel void reduce4(__global uint* input, __global uint* output, __local uint* sdata)
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int blockSize = get_local_size(0);
unsigned int index = bid*(BLOCK_SIZE*2) + tid;
sdata[tid] = input[index] + input[index+BLOCK_SIZE];
for(unsigned int s = BLOCK_SIZE/2; s > 64 ; s >>= 1) {
// Unrolling the last wavefront and we cut 7 iterations of this
// for-loop while we practice wavefront-programming
if(tid < s)
sdata[tid] += sdata[tid + s];
if (tid < 64) {
if (blockSize >= 128) sdata[tid] += sdata[tid + 64];
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
// write result for this block to global mem
if(tid == 0)
output[bid] = sdata[0];
It always says:
Compilation started
:38:2: error: expected identifier or '('
Compilation failed
this is for the last line, where I have put }. What is wrong here?
This is how I am reading the kernel file:
int offset = 0;
for(int i = 0; i < numOfDevices; ++i, ++offset ) {
/* Load the two source files into temporary datastores */
const char *file_names[] = {"SimpleOptimizations.cl"};
const int NUMBER_OF_FILES = 1;
char* buffer[NUMBER_OF_FILES];
size_t sizes[NUMBER_OF_FILES];
loadProgramSource(file_names, NUMBER_OF_FILES, buffer, sizes);
/* Create the OpenCL program object */
program = clCreateProgramWithSource(context, NUMBER_OF_FILES, (const char**)buffer, sizes, &error);
if(error != CL_SUCCESS) {
perror("Can't create the OpenCL program object");
Definition of loadProgramSource
void loadProgramSource(const char** files,
size_t length,
char** buffer,
size_t* sizes) {
/* Read each source file (*.cl) and store the contents into a temporary datastore */
for(size_t i=0; i < length; i++) {
FILE* file = fopen(files[i], "r");
if(file == NULL) {
perror("Couldn't read the program file");
fseek(file, 0, SEEK_END);
sizes[i] = ftell(file);
rewind(file); // reset the file pointer so that 'fread' reads from the front
buffer[i] = (char*)malloc(sizes[i]+1);
buffer[i][sizes[i]] = '\0';
fread(buffer[i], sizeof(char), sizes[i], file);
I believe this is an issue with the way the Windows deals with text files opened with fopen(). If you take a look at the MSDN page for fopen(), it indicates that if you open a file with just "r" as the mode string, some translations will happen with regards to line-endings. This means that the size of the file you query may not match the amount of data read by fread().
To solve this, simply change the mode string to indicate that you wish to read the file as binary data (i.e. without any pesky translations):
FILE* file = fopen(files[i], "rb");

Valgrind error: Invalid read of size 1

I cant find the error in this code, Im looking at it for hours... Valgrind says:
==23114== Invalid read of size 1
==23114== Invalid write of size 1
I tried debugging with some printfs, and i think that the error is in this function.
void rdm_hide(char *name, Byte* img, Byte* bits, int msg, int n, int size)
FILE *fp;
int r;/
Byte* used;
int i = 0, j = 0;
int p;
fp = fopen(name, "wb");
used = malloc(sizeof(Byte) * msg);
for(i = 0; i < msg; i++)
used[i] = -1;
while(i < 3)
if(img[j] == '\n')
for(i = 0; i < msg; i++)
r = genrand_int32();
p = r % n;
if(!search(p, used, msg))
used[i] = (Byte)p;
if(bits[i] == (Byte)0)
img[j + p] = img[j + p] & (~1);
else if(bits[i] == (Byte)1)
img[j + p] = img[j + p] | 1;
i --;
for(i = 0; i < size; i++)
fputc( (char) img[i], fp);
Thanks for help!
==23114== Invalid read of size 1
==23114== Invalid write of size 1
I am pretty sure that's not all valgrind says.
You should
Build your program with debug info (most likely -g flag). This will let valgrind tell you exactly which line triggers invalid read and write
If the problem doesn't become obvious, edit your question and include entire valgrind output.
Re-running valgrind --track-origins=yes your-exe may provide additional useful info.
Lastly, your algorithm appears to be totally bogus. As far as I can tell, the j becomes 3 after the first while loop and never changes after that (in which case you should just use const int j = 3; and do away with j++). Also, you reference img[j + p], where p is between 0 and n. If n is indeed the size of img, then it's little surprise that j + p indexes outside of the img limits, and triggers both errors.
