Explicit FDM with CUDA [closed] - multidimensional-array

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am working to implement CUDA for the following code. The first version has been written serially and the second version is written with CUDA. I am sure about its results in serial version. I expect that the second version that I have added CUDA functionality also give me the same result, but it seems that kernel function does not do any thing and it gives me the initial value of u and v. I know due to lack of my experience, the bug may be obvious, but I cannot figure it out. Also, please do not recommend using flatten array, because it is harder for me to understand the indexing in code.
First version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
#include <chrono>
#include <omp.h>
using namespace std;
const int M = 1024;
const int N = 1024;
const double A = 1;
const double B = 3;
const double Du = 5 * pow(10, -5);
const double Dv = 5 * pow(10, -6);
const int Max_Itr = 1000;
const double h = 1.0 / static_cast<double>(M - 1);
const double delta_t = 0.0025;
const double s1 = (Du * delta_t) / pow(h, 2);
const double s2 = (Dv * delta_t) / pow(h, 2);
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
for (int k = 1; k < Max_Itr; k++) {
for (int i = 1; i < N - 1; i++) {
for (int j = 1; j < M - 1; j++) {
u[i][j] = ((1 - (4 * s1)) * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2)) * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
Second version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define M 1024
#define N 1024
__global__ void my_kernel(double** v, double** u){
int i= blockIdx.y * blockDim.y + threadIdx.y;
int j= blockIdx.x * blockDim.x + threadIdx.x;
double A = 1;
double B = 3;
int Max_Itr = 1000;
double delta_t = 0.0025;
double Du = 5 * powf(10, -5);
double Dv = 5 * powf(10, -6);
double h = 1.0 / (M - 1);
double s1 = (Du * delta_t) / powf(h, 2);
double s2 = (Dv * delta_t) / powf(h, 2);
for (int k = 1; k < Max_Itr; k++) {
u[i][j] = ((1 - (4 * s1))
* u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2))
* v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
dim3 blocks(32,32);
dim3 grids(M/32 +1, N/32 + 1);
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
double **u_d, **v_d;
int d_size = N * M * sizeof(double);
cudaMalloc(&u_d, d_size);
cudaMalloc(&v_d, d_size);
cudaMemcpy(u_d, u, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(v_d, v, d_size, cudaMemcpyHostToDevice);
my_kernel<<<grids, blocks>>> (v_d,u_d);
cudaMemcpy(v, v_d, d_size, cudaMemcpyDeviceToHost);
cudaMemcpy(u, u_d, d_size, cudaMemcpyDeviceToHost);
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
What I expect from the second version is :
u: 0.2815 v: 1.7581

Your two-dimensional array - in the first version of the program - is implemented using an array of pointers, each of which to a separately-allocated array of double values.
In your second version, you are using the same pointer-to-pointer-to-double type, but - you're not allocating any space for the actual data, just for the array of pointers (and not copying any of the data to the GPU - just the pointers; which are useless to copy anyway, since they're pointers to host-side memory.)
What is most likely happening is that your kernel attempts to access memory at an invalid address, and its execution is aborted.
If you were to properly check for errors, as #njuffa noted, you would know that is what happened.
Now, you could avoid having to make multiple memory allocations if you were to use a single data area instead of separate allocations for each second-dimension 1D array; and that is true both for the first and the second version of your program. That would not quite be array flattening. See an explanation of how to do this (C-language-style) on this page.
Note, however, that double-dereferencing, which you insist on performing in your kernel, is likely slowing it down significantly.


Solving DDE system

im trying to solve a differentiel delay equation system with c++. Im a newbie in terms of coding, so please if you have recommendations, tell me, I would like to improve my writing! What i want to do: initialize the history-array and then start to solve the differential equation by overwriting the history-array. But the problem is, I get the error message:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 9999) >= this->size() (which is 9999)
It seems that the history-arrays are out of range. I tried to put a std::cout in after the second if-condition to check if the code is going through the second for-loop, but he isn't. Since im learning c++ by doing right now, the problem isn't really clear to me. I hope someone sees the error. And dont hesitate to improve my code, I would really appreciate!
Thanks for your help!
#include <iostream>
#include <vector>
#include <cmath>
#include <iomanip>
#include <fstream>
const double pi = 3.14159265358979323846;
int tau = 1;
//initial values
double x = 1.0;
double y = 1.0;
double t = 0.0;
//constants and parameters
double K = 0.25;
double lam = 0.5;
double omega = pi;
double dx, dy;
double dt = pow(10.0, -4.0);
//number of steps
int Delta = static_cast<int>(tau/dt);
std::vector<double> hist_x((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> hist_y((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> t_val;
std::vector<double> x_val;
std::vector<double> y_val;
double euler(double f, double di, double time_step){
f = f + time_step * di;
return f;
int main()
std::ofstream file_x;
std::ofstream file_y;
std::ofstream file_t;
for(int n = 0; n < 2; n++){
for(int j; j < Delta; j++){
dx = lam * x + omega * x;
dy = lam * y - omega * x;
x = euler(x, dx, dt);
y = euler(y, dy, dt);
t = t + dt;
hist_x.at(j) = x;
hist_y.at(j) = y;
for(int k = 0; k < Delta; k++){
dx = lam * x + omega * x - K * ( x - hist_x.at(k) );
dy = lam * y - omega * x - K * ( y - hist_y.at(k) );
x = euler(x, dx, dt);
y = euler(y, dy, dt);
t = t + dt;
hist_x.at(k) = x;
hist_y.at(k) = y;
file_x<<x_val.at(k + n * Delta)<<std::endl;
file_y<<y_val.at(k + n * Delta)<<std::endl;
file_t<<t_val.at(k + n * Delta)<<std::endl;
for(int j; j < Delta; j++){
You forgot to initialize j; you meant:
for (int j = 0; j < Delta; j++)
int Delta = static_cast<int>(tau/dt);
std::vector<double> hist_x((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> hist_y((static_cast<int>(tau/dt) - 1), 0.0);
You index from 0 to Delta−1, this means the vectors need to have Delta elements, and you allocate one less; correct:
std::vector<double> hist_x(Delta, 0.0);
std::vector<double> hist_y(Delta, 0.0);

How to remove floats or reduce actual file size of code function? (Arduino)

I am trying to get my arduino code for gemma, with neopixels, which has 5310 bytes of memory smaller so I can get more things into the program.
Currently I am trying to remove floats / reduce the size of the code snippet below:
void gradient(Color c1, Color c2, float time) {
for (float i = 0; i < time; i += 0.001) {
Color result(0, 0, 0);
result.Red = c1.Red * (1 - (i / time)) + c2.Red * (i / time);
result.Green = c1.Green * (1 - (i / time)) + c2.Green * (i / time);
result.Blue = c1.Blue * (1 - (i / time)) + c2.Blue * (i / time);
for (uint8_t x = 0; x < 20; x++)pixels.setPixelColor(x, result.Red, result.Green, result.Blue);
I managed to reduce it by 30 bytes to:
void gradient(Color c1, Color c2, float time) {
float stepsize = 0.01; // Stepsize in seconds
float lambda;
int maxiter = (int) (time/ stepsize);
Color result(0, 0, 0);
for (int i = 0; i <= maxiter; i++) {
lambda = (float) i / maxiter;
result.Red = c1.Red * (1 - lambda) + c2.Red * (lambda);
result.Green = c1.Green * (1 - lambda) + c2.Green * (lambda);
result.Blue = c1.Blue * (1 - lambda) + c2.Blue * (lambda);
for (uint8_t x = 0; x < 20; x++)pixels.setPixelColor(x, result.Red, result.Green, result.Blue);
delay(stepsize * 1000); // delay in milliseconds
But am trying still to make it smaller.
For those wondering the Color object is just an object with 3 ints called Red, Green and Blue. An example usage of this code would be:
gradient(Color(255, 0, 0), Color(0, 255, 0), 2);
Which would be a gradient from Red to Green over 2 seconds.
Thanks in advance!
If you can pull "delay()" out of all your code, it seems to avoid including a 100 byte size library? idk tbh, but here is my suggested modification, which in my testing saves 100 bytes of memory:
void gradient(Color c1, Color c2, float time) {
float stepsize = 0.01; // Stepsize in seconds
float lambda;
int maxiter = (int) (time/ stepsize);
Color result(0, 0, 0);
for (int i = 0; i <= maxiter; i++) {
lambda = (float) i / maxiter;
result.Red = c1.Red * (1 - lambda) + c2.Red * (lambda);
result.Green = c1.Green * (1 - lambda) + c2.Green * (lambda);
result.Blue = c1.Blue * (1 - lambda) + c2.Blue * (lambda);
for (uint8_t x = 0; x < 20; x++)pixels.setPixelColor(x, result.Red, result.Green, result.Blue);
//delay(stepsize * 1000); // delay in milliseconds
long lastTime=millis();
long delayTime = stepsize * 1000;
-First off, your color object should take 3 unsigned chars (0-255) there is no reason to put ints in there. (byte type in arduino)
-Second, I am not sure how you are implementing time, but generally in arduino you are working in milliseconds. Furthermore, without seeing your other implementation, I am guessing that time is a segment of time and based on your delay, I am going to guess that you could send time as a short (up multiply x1000 if necessary) (This would hold up to 32 seconds, in milliseconds)
void gradient(Color c1, Color c2, short time) {
short maxiter = (short) (time/ 10);
Color result(0, 0, 0);
for (int i = 0; i <= maxiter; i++) {
result.Red = (c1.Red * (maxiter-i) + c2.Red * i)/maxiter;
result.Green = (c1.Green* (maxiter-i) + c2.Green* i)/maxiter;
result.Blue = (c1.Blue* (maxiter-i) + c2.Blue* i)/maxiter;
for (uint8_t x = 0; x < 20; x++)pixels.setPixelColor(x, result.Red, result.Green, result.Blue);
delay(10); // delay in milliseconds


I'm Trying to convert a code written in Cuda to openCL and run into some trouble. My final goal is to implement the code on an Odroid XU3 board with a Mali T628 GPU.
In order to simplify the transition and save time trying to debug openCL kernels I've done the following steps:
Implement the code in Cuda and test it on a Nvidia GeForce 760
Implement the code in openCL and test it on a Nvidia GeForce 760
test the openCL code on an Odroid XU3 board with a Mali T628 GPU.
I know that different architectures may have different optimizations but that isn't my main concern for now. I manged to run the openCL code on my Nvidia GPU with no apparent issues but keep getting strange errors when trying to run the code on the Odroid board. I know that different architectures have different handling of exceptions etc. but I'm not sure how to solve those.
Since the openCL code works on my Nvidia I assume that I managed to do the correct transition between thread/blocks -> workItems/workGroups etc.
I already fixed several issues that relate to the cl_device_max_work_group_size issue so that can't be the cuase.
When running the code i'm getting a "CL_OUT_OF_RESOURCES" error. I've narrowed the cause of the error to 2 lines in the code but not sure to fix those issues.
the error is caused by the following lines:
lowestDist[pixelNum] = partialDiffSumTemp; both variables are private variables of the kernel and therefor I don't see any potential issue.
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0];
Here I guess the cause is "OUT_OF_BOUND" but not sure how to debug it since the original code doesn't have any issue.
My Kernel code is is:
#define MAX_DISPARITY 55
#define WINDOW_SIZE 19
//TODO fix input arguments
__kernel void hello_kernel( __global unsigned char* d_leftImage,
__global unsigned char* d_rightImage,
__global float* d_disparityLeft) {
int blockX = get_group_id(0);
int blockY = get_group_id(1);
int threadX = get_local_id(0);
int threadY = get_local_id(1);
__local unsigned char leftImage [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
__local unsigned char rightImage [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
__local unsigned int partialDiffSum [BLOCK_WIDTH * TILE_SHARED_MEM_HEIGHT];
int alignedImageWidth = 640;
int partialDiffSumTemp;
float bestDisparity[4] = {0,0,0,0};
int lowestDist[4];
lowestDist[0] = 214748364;
lowestDist[1] = 214748364;
lowestDist[2] = 214748364;
lowestDist[3] = 214748364;
// Read image blocks into shared memory. read is done at 32bit integers on a uchar array. each thread reads 3 integers(12byte) 96/12=8threads
int sharedMemIdx = threadY * TILE_SHARED_MEM_WIDTH + 4 * threadX;
int globalMemIdx = (blockY * BLOCK_HEIGHT + threadY) * alignedImageWidth + blockX * BLOCK_WIDTH + 4 * threadX;
for (int i = 0; i < 4; i++) {
leftImage [sharedMemIdx + i ] = d_leftImage [globalMemIdx + i];
leftImage [sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
leftImage [sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
rightImage[sharedMemIdx + i ] = d_rightImage[globalMemIdx + i];
rightImage[sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
rightImage[sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
int imageIdx = sharedMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS;
int partialSumIdx = threadY * BLOCK_WIDTH + 4 * threadX;
for(int dispLevel = MIN_DISPARITY; dispLevel <= MAX_DISPARITY; dispLevel++) {
// horizontal partial sum
partialDiffSumTemp = 0;
#pragma unroll
for(int i = imageIdx - WINDOW_RADIUS; i <= imageIdx + WINDOW_RADIUS; i++) {
//partialDiffSumTemp += calcDiff(leftImage [i], rightImage[i - dispLevel]);
partialDiffSumTemp += abs(leftImage[i] - rightImage[i - dispLevel]);
partialDiffSum[partialSumIdx] = partialDiffSumTemp;
for (int pixelNum = 1, i = imageIdx - WINDOW_RADIUS; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++, i++) {
partialDiffSum[partialSumIdx + pixelNum] = partialDiffSum[partialSumIdx + pixelNum - 1] +
abs(leftImage[i + WINDOW_SIZE] - rightImage[i - dispLevel + WINDOW_SIZE]) -
abs(leftImage[i] - rightImage[i - dispLevel]);
// vertical sum
for (int pixelNum = 0; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++) {
int rowIdx = partialSumIdx - WINDOW_RADIUS * BLOCK_WIDTH;
partialDiffSumTemp = 0;
for(int i = -WINDOW_RADIUS; i <= WINDOW_RADIUS; i++,rowIdx += BLOCK_WIDTH) {
partialDiffSumTemp += partialDiffSum[rowIdx + pixelNum];
if (partialDiffSumTemp < lowestDist[pixelNum]) {
lowestDist[pixelNum] = partialDiffSumTemp;
bestDisparity[pixelNum] = dispLevel - 1;
if (threadY >= WINDOW_RADIUS && threadY < TILE_SHARED_MEM_HEIGHT - WINDOW_RADIUS && blockY < 32) {
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 1] = bestDisparity[1];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 2] = bestDisparity[2];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 3] = bestDisparity[3];
Thanks for all the help
From my experience NVidia GPUs not always crash on out of bound access and many times kernel still returns expected results.
Use printf to check the indexes. If you have Nvidia OpenCL 1.2 driver installed printf should be available as a core function. As far as I checked Mali-T628 uses OpenCL 1.1 then check if printf is available as a vendor extension. Also you can run your kernel on AMD/Intel CPU where printf is available (OpenCL 1.2 / 2.0).
Alternative way of checking indexes can be passing __global int* debug array where you would store indexes and then check them on the host. Make sure to allocate it big enough so that out of bound index will be recorded.

Dijkstra's algorithm in CUDA

I am having troubles with this piece of CUDA code I have written. This is supposed to be the CUDA implementation of the Dijkstra's algorithm. The code is as follows:
__global__ void cuda_dijkstra_kernel_1(float* Va, int* Ea, int* Sa, float* Ca, float* Ua, char* Ma, unsigned int* lock){
int tid = blockIdx.x;
Ma[tid] = '0';
int ind_Ea = Sa[tid * 2];
int num_edges = Sa[(tid * 2) + 1];
int v;
float wt = 0;
unsigned int leaveloop;
leaveloop = 0u;
if(atomicExch(lock, 1u) == 0u){
for(v = 0; v < num_edges; v++){
wt = (Va[tid * 3] - Va[Ea[ind_Ea + v] * 3]) * (Va[tid * 3] - Va[Ea[ind_Ea + v] * 3]) +
(Va[(tid * 3) + 1] - Va[(Ea[ind_Ea + v] * 3) + 1]) * (Va[(tid * 3) + 1] - Va[(Ea[ind_Ea + v] * 3) + 1]) +
(Va[(tid * 3) + 2] - Va[(Ea[ind_Ea + v] * 3) + 2]) * (Va[(tid * 3) + 2] - Va[(Ea[ind_Ea + v] * 3) + 2]) ;
wt = sqrt(wt);
if(Ca[Ea[ind_Ea + v]] > (Ca[tid] + wt)){
Ca[Ea[ind_Ea + v]] = Ca[tid] + wt;
Ma[Ea[ind_Ea + v]] = '1';
leaveloop = 1u;
atomicExch(lock, 0u);
The problem is in the relaxation phase of the Dijkstra's algorithm. I have implemented such a phase as a critical section. If there is a vertex (lets say a) which is a neighbor of more than one vertex (i.e., connecting to other vertices with edges), then all of the threads for those vertices will try to write to the location of vertex a in the Cost Array Ca. Now my goal is to have the smaller value written in that location. To do that, I am trying to serialize the process and applying __threadfence() as well so that value written by one thread is visible to others and then eventually the smaller value is retained in the location of vertex a. But the problem is, that this logic is not working. The location of vertex a does not get the smallest value of all the threads trying to write to that location and I don't understand why. Any help will be highly appreciated.
There is a "classical" (at least, mostly referenced) implementation of Dijkstra's Single-Source Shortest Path (SSSP) algorithm on the GPU contained in the paper
Accelerating large graph algorithms on the GPU using CUDA by Parwan Harish and P.J. Narayanan
However, the implementation in that paper has been recognized to be bugged, see
CUDA Solutions for the SSSP Problem by Pedro J. Martín, Roberto Torres, and Antonio Gavilanes
I'm reporting below the implementation suggested in the first paper fixed according to the remark of the second. The code also contains a C++ version.
#include <sstream>
#include <vector>
#include <iostream>
#include <stdio.h>
#include <float.h>
#include "Utilities.cuh"
#define NUM_ASYNCHRONOUS_ITERATIONS 20 // Number of async loop iterations before attempting to read results back
#define BLOCK_SIZE 16
// --- The graph data structure is an adjacency list.
typedef struct {
// --- Contains the integer offset to point to the edge list for each vertex
int *vertexArray;
// --- Overall number of vertices
int numVertices;
// --- Contains the "destination" vertices each edge is attached to
int *edgeArray;
// --- Overall number of edges
int numEdges;
// --- Contains the weight of each edge
float *weightArray;
} GraphData;
void generateRandomGraph(GraphData *graph, int numVertices, int neighborsPerVertex) {
graph -> numVertices = numVertices;
graph -> vertexArray = (int *)malloc(graph -> numVertices * sizeof(int));
graph -> numEdges = numVertices * neighborsPerVertex;
graph -> edgeArray = (int *)malloc(graph -> numEdges * sizeof(int));
graph -> weightArray = (float *)malloc(graph -> numEdges * sizeof(float));
for (int i = 0; i < graph -> numVertices; i++) graph -> vertexArray[i] = i * neighborsPerVertex;
int *tempArray = (int *)malloc(neighborsPerVertex * sizeof(int));
for (int k = 0; k < numVertices; k++) {
for (int l = 0; l < neighborsPerVertex; l++) tempArray[l] = INT_MAX;
for (int l = 0; l < neighborsPerVertex; l++) {
bool goOn = false;
int temp;
while (goOn == false) {
goOn = true;
temp = (rand() % graph->numVertices);
for (int t = 0; t < neighborsPerVertex; t++)
if (temp == tempArray[t]) goOn = false;
if (temp == k) goOn = false;
if (goOn == true) tempArray[l] = temp;
graph -> edgeArray [k * neighborsPerVertex + l] = temp;
graph -> weightArray[k * neighborsPerVertex + l] = (float)(rand() % 1000) / 1000.0f;
/* minDistance FUNCTION */
// --- Finds the vertex with minimum distance value, from the set of vertices not yet included in shortest path tree
int minDistance(float *shortestDistances, bool *finalizedVertices, const int sourceVertex, const int N) {
// --- Initialize minimum value
int minIndex = sourceVertex;
float min = FLT_MAX;
for (int v = 0; v < N; v++)
if (finalizedVertices[v] == false && shortestDistances[v] <= min) min = shortestDistances[v], minIndex = v;
return minIndex;
/* dijkstraCPU FUNCTION */
void dijkstraCPU(float *graph, float *h_shortestDistances, int sourceVertex, const int N) {
// --- h_finalizedVertices[i] is true if vertex i is included in the shortest path tree
// or the shortest distance from the source node to i is finalized
bool *h_finalizedVertices = (bool *)malloc(N * sizeof(bool));
// --- Initialize h_shortestDistancesances as infinite and h_shortestDistances as false
for (int i = 0; i < N; i++) h_shortestDistances[i] = FLT_MAX, h_finalizedVertices[i] = false;
// --- h_shortestDistancesance of the source vertex from itself is always 0
h_shortestDistances[sourceVertex] = 0.f;
// --- Dijkstra iterations
for (int iterCount = 0; iterCount < N - 1; iterCount++) {
// --- Selecting the minimum distance vertex from the set of vertices not yet
// processed. currentVertex is always equal to sourceVertex in the first iteration.
int currentVertex = minDistance(h_shortestDistances, h_finalizedVertices, sourceVertex, N);
// --- Mark the current vertex as processed
h_finalizedVertices[currentVertex] = true;
// --- Relaxation loop
for (int v = 0; v < N; v++) {
// --- Update dist[v] only if it is not in h_finalizedVertices, there is an edge
// from u to v, and the cost of the path from the source vertex to v through
// currentVertex is smaller than the current value of h_shortestDistances[v]
if (!h_finalizedVertices[v] &&
graph[currentVertex * N + v] &&
h_shortestDistances[currentVertex] != FLT_MAX &&
h_shortestDistances[currentVertex] + graph[currentVertex * N + v] < h_shortestDistances[v])
h_shortestDistances[v] = h_shortestDistances[currentVertex] + graph[currentVertex * N + v];
// --- Check whether all the vertices have been finalized. This tells the algorithm whether it needs to continue running or not.
bool allFinalizedVertices(bool *finalizedVertices, int numVertices) {
for (int i = 0; i < numVertices; i++) if (finalizedVertices[i] == true) { return false; }
return true;
__global__ void initializeArrays(bool * __restrict__ d_finalizedVertices, float* __restrict__ d_shortestDistances, float* __restrict__ d_updatingShortestDistances,
const int sourceVertex, const int numVertices) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < numVertices) {
if (sourceVertex == tid) {
d_finalizedVertices[tid] = true;
d_shortestDistances[tid] = 0.f;
d_updatingShortestDistances[tid] = 0.f; }
else {
d_finalizedVertices[tid] = false;
d_shortestDistances[tid] = FLT_MAX;
d_updatingShortestDistances[tid] = FLT_MAX;
__global__ void Kernel1(const int * __restrict__ vertexArray, const int* __restrict__ edgeArray,
const float * __restrict__ weightArray, bool * __restrict__ finalizedVertices, float* __restrict__ shortestDistances,
float * __restrict__ updatingShortestDistances, const int numVertices, const int numEdges) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if (tid < numVertices) {
if (finalizedVertices[tid] == true) {
finalizedVertices[tid] = false;
int edgeStart = vertexArray[tid], edgeEnd;
if (tid + 1 < (numVertices)) edgeEnd = vertexArray[tid + 1];
else edgeEnd = numEdges;
for (int edge = edgeStart; edge < edgeEnd; edge++) {
int nid = edgeArray[edge];
atomicMin(&updatingShortestDistances[nid], shortestDistances[tid] + weightArray[edge]);
__global__ void Kernel2(const int * __restrict__ vertexArray, const int * __restrict__ edgeArray, const float* __restrict__ weightArray,
bool * __restrict__ finalizedVertices, float* __restrict__ shortestDistances, float* __restrict__ updatingShortestDistances,
const int numVertices) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < numVertices) {
if (shortestDistances[tid] > updatingShortestDistances[tid]) {
shortestDistances[tid] = updatingShortestDistances[tid];
finalizedVertices[tid] = true; }
updatingShortestDistances[tid] = shortestDistances[tid];
/* dijkstraGPU FUNCTION */
void dijkstraGPU(GraphData *graph, const int sourceVertex, float * __restrict__ h_shortestDistances) {
// --- Create device-side adjacency-list, namely, vertex array Va, edge array Ea and weight array Wa from G(V,E,W)
int *d_vertexArray; gpuErrchk(cudaMalloc(&d_vertexArray, sizeof(int) * graph -> numVertices));
int *d_edgeArray; gpuErrchk(cudaMalloc(&d_edgeArray, sizeof(int) * graph -> numEdges));
float *d_weightArray; gpuErrchk(cudaMalloc(&d_weightArray, sizeof(float) * graph -> numEdges));
// --- Copy adjacency-list to the device
gpuErrchk(cudaMemcpy(d_vertexArray, graph -> vertexArray, sizeof(int) * graph -> numVertices, cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_edgeArray, graph -> edgeArray, sizeof(int) * graph -> numEdges, cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_weightArray, graph -> weightArray, sizeof(float) * graph -> numEdges, cudaMemcpyHostToDevice));
// --- Create mask array Ma, cost array Ca and updating cost array Ua of size V
bool *d_finalizedVertices; gpuErrchk(cudaMalloc(&d_finalizedVertices, sizeof(bool) * graph->numVertices));
float *d_shortestDistances; gpuErrchk(cudaMalloc(&d_shortestDistances, sizeof(float) * graph->numVertices));
float *d_updatingShortestDistances; gpuErrchk(cudaMalloc(&d_updatingShortestDistances, sizeof(float) * graph->numVertices));
bool *h_finalizedVertices = (bool *)malloc(sizeof(bool) * graph->numVertices);
// --- Initialize mask Ma to false, cost array Ca and Updating cost array Ua to \u221e
initializeArrays <<<iDivUp(graph->numVertices, BLOCK_SIZE), BLOCK_SIZE >>>(d_finalizedVertices, d_shortestDistances,
d_updatingShortestDistances, sourceVertex, graph -> numVertices);
// --- Read mask array from device -> host
gpuErrchk(cudaMemcpy(h_finalizedVertices, d_finalizedVertices, sizeof(bool) * graph->numVertices, cudaMemcpyDeviceToHost));
while (!allFinalizedVertices(h_finalizedVertices, graph->numVertices)) {
// --- In order to improve performance, we run some number of iterations without reading the results. This might result
// in running more iterations than necessary at times, but it will in most cases be faster because we are doing less
// stalling of the GPU waiting for results.
for (int asyncIter = 0; asyncIter < NUM_ASYNCHRONOUS_ITERATIONS; asyncIter++) {
Kernel1 <<<iDivUp(graph->numVertices, BLOCK_SIZE), BLOCK_SIZE >>>(d_vertexArray, d_edgeArray, d_weightArray, d_finalizedVertices, d_shortestDistances,
d_updatingShortestDistances, graph->numVertices, graph->numEdges);
Kernel2 <<<iDivUp(graph->numVertices, BLOCK_SIZE), BLOCK_SIZE >>>(d_vertexArray, d_edgeArray, d_weightArray, d_finalizedVertices, d_shortestDistances, d_updatingShortestDistances,
gpuErrchk(cudaMemcpy(h_finalizedVertices, d_finalizedVertices, sizeof(bool) * graph->numVertices, cudaMemcpyDeviceToHost));
// --- Copy the result to host
gpuErrchk(cudaMemcpy(h_shortestDistances, d_shortestDistances, sizeof(float) * graph->numVertices, cudaMemcpyDeviceToHost));
int main() {
// --- Number of graph vertices
int numVertices = 8;
// --- Number of edges per graph vertex
int neighborsPerVertex = 6;
// --- Source vertex
int sourceVertex = 0;
// --- Allocate memory for arrays
GraphData graph;
generateRandomGraph(&graph, numVertices, neighborsPerVertex);
// --- From adjacency list to adjacency matrix.
// Initializing the adjacency matrix
float *weightMatrix = (float *)malloc(numVertices * numVertices * sizeof(float));
for (int k = 0; k < numVertices * numVertices; k++) weightMatrix[k] = FLT_MAX;
// --- Displaying the adjacency list and constructing the adjacency matrix
printf("Adjacency list\n");
for (int k = 0; k < numVertices; k++) weightMatrix[k * numVertices + k] = 0.f;
for (int k = 0; k < numVertices; k++)
for (int l = 0; l < neighborsPerVertex; l++) {
weightMatrix[k * numVertices + graph.edgeArray[graph.vertexArray[k] + l]] = graph.weightArray[graph.vertexArray[k] + l];
printf("Vertex nr. %i; Edge nr. %i; Weight = %f\n", k, graph.edgeArray[graph.vertexArray[k] + l],
graph.weightArray[graph.vertexArray[k] + l]);
for (int k = 0; k < numVertices * neighborsPerVertex; k++)
printf("%i %i %f\n", k, graph.edgeArray[k], graph.weightArray[k]);
// --- Displaying the adjacency matrix
printf("\nAdjacency matrix\n");
for (int k = 0; k < numVertices; k++) {
for (int l = 0; l < numVertices; l++)
if (weightMatrix[k * numVertices + l] < FLT_MAX)
printf("%1.3f\t", weightMatrix[k * numVertices + l]);
// --- Running Dijkstra on the CPU
float *h_shortestDistancesCPU = (float *)malloc(numVertices * sizeof(float));
dijkstraCPU(weightMatrix, h_shortestDistancesCPU, sourceVertex, numVertices);
printf("\nCPU results\n");
for (int k = 0; k < numVertices; k++) printf("From vertex %i to vertex %i = %f\n", sourceVertex, k, h_shortestDistancesCPU[k]);
// --- Allocate space for the h_shortestDistancesGPU
float *h_shortestDistancesGPU = (float*)malloc(sizeof(float) * graph.numVertices);
dijkstraGPU(&graph, sourceVertex, h_shortestDistancesGPU);
printf("\nGPU results\n");
for (int k = 0; k < numVertices; k++) printf("From vertex %i to vertex %i = %f\n", sourceVertex, k, h_shortestDistancesGPU[k]);
return 0;

opencl kernel implementing a simple mathematical formula

What are the best practices to consider when implementing an error function defined as
using an OpenCL kernel?
A, B and C are 3D float arrays and \delta is the Kronecker delta.
Typical values for (N, M) = (2, 7) or (N, M) = (3, 23).
The naive implementation (given below) is by several orders of magnitude slower than the CPU version.
__kernel void cl_bilinear_alg(
__global float * A,
__global float * B,
__global float * C,
__global const int M,
__global const int N,
__global float * R)
int index = get_global_id(0);
int N2 = N * N;
int mat_offset = index * N2 * M;
float s1, s2, err = 0.0f;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < N; ++k)
for (int l = 0; l < N; ++l)
for (int m = 0; m < N; ++m)
for (int n = 0; n < N; ++n)
s1 = (n == i) * (j == k) * (l == m);
s2 = 0;
for (int r = 0; r < M; ++r)
s2 += A[mat_offset + r * N2 + i * N + j] *
B[mat_offset + r * N2 + k * N + l] *
C[mat_offset + r * N2 + m * N + n];
err += (s2 - s1) * (s2 - s1);
R[index] = err;
The primary target is a Geforce GTX 570, though this could change in the future.
After vectorizing the code, moving bits to local memory, unrolling some loops and passing precomputed Kronecker products explicitly to the kernel the code looks as follows:
__kernel void cl_bilinear_alg(__global const float * A,
__global const float * B,
__global const float * C,
__global const int N,
__global const int M,
__global const float * kron,
__global float * R)
__private int index = get_global_id(0);
__private int cM = ceil(M / 4.0f);
__private int N2 = N*N;
__private int N4 = N2*N2;
__private int mat_offset = index * N2 * M;
__private float s1, s2, err = 0;
__private float4 vzero = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
__local float4 va[54], vb[54], vc[54];
for (int ij = 0, k = 0; ij < N2; ++ij)
int r = 0;
for (; r < M / 4; r += 4, ++k)
int idx0 = mat_offset + N2 * r + ij;
int idx1 = mat_offset + N2 * (r + 1) + ij;
int idx2 = mat_offset + N2 * (r + 2) + ij;
int idx3 = mat_offset + N2 * (r + 3) + ij;
va[k] = (float4) (A[idx0], A[idx1], A[idx2], A[idx3]);
vb[k] = (float4) (B[idx0], B[idx1], B[idx2], B[idx3]);
vc[k] = (float4) (C[idx0], C[idx1], C[idx2], C[idx3]);
if (M % 4)
float buffa[4] = {0}, buffb[4] = {0}, buffc[4] = {0};
for (; r < M; ++r)
int idx = mat_offset + N2 * r + ij;
buffa[r % 4] = A[idx];
buffb[r % 4] = B[idx];
buffc[r % 4] = C[idx];
va[k] = vload4(0, buffa);
vb[k] = vload4(0, buffb);
vc[k++] = vload4(0, buffc);
for (int ij = 0; ij < N2; ++ij)
for (int kl = 0; kl < N2; ++kl)
for (int mn = 0; mn < N2; ++mn)
s1 = kron[ij * N4 + kl * N2 + mn];
s2 = 0;
for (int r = 0; r < cM; ++r)
s2 += dot(va[cM * ij + r], mad(vb[cM * kl + r], vc[cM * mn + r], vzero));
//the most expensive line
err += (s2 - s1) * (s2 - s1);
R[index] = err;
By applying these changes a 4x speed increase was observed compared to the naive implementation. Furthermore, it was revealed that the most expensive line of all is the error update, i.e.
err += (s2 - s1) * (s2 - s1);
Any suggestions?
Typically you'd want to break some of those loops up... a lot...
- the outer loops become split over multiple workgroups, which run on their own compute unit (there are around 16 compute units per GPU, not many)
- the next few loops would be split over different threads within each workgroup
If you try to run all the calculations all at the same time, they will all try to load the data into memory at the same time, and this will simply thrash horribly. GPUs have very limited memory. Sure, the global memory sounds large enough, several gigabytes, but the global GPU memory is slow. You want to get the data into the local memory, which is per compute unit, and is of the order of 32-64KB, not much more than that.
You'd typically want to somehow divide your task into very small tasks, and do the following, for each workgroup:
load a chunk of memory from global memory into local memory
the whole workgroup warp of threads can participate in doing the copy, using coallesced access
do work on this memory, like doing some sums, and so on
write the results back to global memory
then, can either iterate a bit, or simply exit, and leave other workgroups to handle other bits of the work
On the CPU, the mathematical operations tend to be a major bottleneck, but on the GPU, generally the cores are mostly spinning uselessly, whilst waiting for data to gradually get to them, from global memory. Whatever you can do to optimize this process, prevent conflicting demands, and so on, will make the kernel significantly faster.
