kernel OpenCL not saving results? - opencl

Im making my first steps in OpenCL, and I got a problem with I think results aren't saved correctly. I first made a simple program, that is working, giving the correct results etc. So main is functioning correctly, arrays are filled the way they are supposed to etc.
Brief explanation of the program:
I have a grid of 4x4 points (OP array), in a field that is 40x40. There is a plane flying above this field, and it's route is divided into 100 segments (Segment array). Coord is a typedef struct, with currently only a double x and double y value. I know there are simpler solutions, but I have to use this later on, so I do it this way. The kernel is supposed to calculate the distance from a point (an OP) to a segment. The result has to be saved in the array Distances.
Description of the kernel: it gets is global id, from wich it calculates which OP and which segment it has to calculate with. With that info it gets the x and y from those 2, and calculates the x and y distance between each other. With this information it does an square root from the distances and the flightaltitude.
The problem is, that the Distances table only shows zero's :P
kernel: http://pastebin.com/U9hTWvv2 There is an parameter that says __global const Coord* Afstanden, this has to be __global double* Afstanden
OpenCL stuff in main: http://pastebin.com/H3mPwuUH (because Im not 100% sure Im doing that is right)
I can give you the whole program, but it's mostly dutch :P
I tried to explain everything as good as possible, if something is not clear, I'll try to clear it up ;)
As I made numerous little faults in the first kernel, here is his successor: http://pastebin.com/rs6T1nKs Also the OpenCL part is a bit updated: http://pastebin.com/4NbUQ40L because it also had one or two faults in it.
It should work now, in my opinion... But it isn't, the problem is still standing

Well, it seems that the "Afstanden" variable is declared as a "const Coord*". Remove the constness - i.e, "Coord* only - and be sure that when you use clCreateBuffer "CL_MEM_WRITE_ONLY" or "CL_MEM_READ_WRITE" as an argument.
"The problem is, that the Distances table only shows zero's :P"
You are probably initializing Afstanden on the CPU side with zeros. Am I correct? This would explain this behavior.

Related

Could I ask for physical analogies or metaphors for recursion?

I am suddenly in a recursive language class (sml) and recursion is not yet physically sensible for me. I'm thinking about the way a floor of square tiles is sometimes a model or metaphor for integer multiplication, or Cuisenaire Rods are a model or analogue for addition and subtraction. Does anyone have any such models you could share?
Imagine you're a real life magician, and can make a copy of yourself. You create your double a step closer to the goal and give him (or her) the same orders as you were given.
Your double does the same to his copy. He's a magician too, you see.
When the final copy finds itself created at the goal, it has nowhere more to go, so it reports back to its creator. Which does the same.
Eventually, you get your answer back – without having moved an inch – and can now create the final result from it, easily. You get to pretend not knowing about all those doubles doing the actual hard work for you. "Hmm," you're saying to yourself, "what if I were one step closer to the goal and already knew the result? Wouldn't it be easy to find the final answer then ?" (*)
Of course, if you were a double, you'd have to report your findings to your creator.
More here.
(also, I think I saw this "doubles" creation chain event here, though I'm not entirely sure).
(*) and that is the essence of the recursion method of problem solving.
How do I know my procedure is right? If my simple little combination step produces a valid solution, under assumption it produced the correct solution for the smaller case, all I need is to make sure it works for the smallest case – the base case – and then by induction the validity is proven!
Another possibility is divide-and-conquer, where we split our problem in two halves, so will get to the base case much much faster. As long as the combination step is simple (and preserves validity of solution of course), it works. In our magician metaphor, I get to create two copies of myself, and combine their two answers into one when they are finished. Each of them creates two copies of themselves as well, so this creates a branching tree of magicians, instead of a simple line as before.
A good example is the Sierpinski triangle which is a figure that is built from three quarter-sized Sierpinski triangles simply, by stacking them up at their corners.
Each of the three component triangles is built according to the same recipe.
Although it doesn't have the base case, and so the recursion is unbounded (bottomless; infinite), any finite representation of S.T. will presumably draw just a dot in place of the S.T. which is too small (serving as the base case, stopping the recursion).
There's a nice picture of it in the linked Wikipedia article.
Recursively drawing an S.T. without the size limit will never draw anything on screen! For mathematicians recursion may be great, engineers though should be more cautious about it. :)
Switching to corecursion ⁄ iteration (see the linked answer for that), we would first draw the outlines, and the interiors after that; so even without the size limit the picture would appear pretty quickly. The program would then be busy without any noticeable effect, but that's better than the empty screen.
I came across this piece from Edsger W. Dijkstra; he tells how his child grabbed recursions:
A few years later a five-year old son would show me how smoothly the idea of recursion comes to the unspoilt mind. Walking with me in the middle of town he suddenly remarked to me, Daddy, not every boat has a lifeboat, has it? I said How come? Well, the lifeboat could have a smaller lifeboat, but then that would be without one.
I love this question and couldn't resist to add an answer...
Recursion is the russian doll of programming. The first example that come to my mind is closer to an example of mutual recursion :
Mutual recursion everyday example
Mutual recursion is a particular case of recursion (but sometimes it's easier to understand from a particular case than from a generic one) when we have two function A and B defined like A calls B and B calls A. You can experiment this very easily using a webcam (it also works with 2 mirrors):
display the webcam output on your screen with VLC, or any software that can do it.
Point your webcam to the screen.
The screen will progressively display an infinite "vortex" of screen.
What happens ?
The webcam (A) capture the screen (B)
The screen display the image captured by the webcam (the screen itself).
The webcam capture the screen with a screen displayed on it.
The screen display that image (now there are two screens displayed)
And so on.
You finally end up with such an image (yes, my webcam is total crap):
"Simple" recursion is more or less the same except that there is only one actor (function) that calls itself (A calls A)
"Simple" Recursion
That's more or less the same answer as #WillNess but with a little code and some interactivity (using the js snippets of SO)
Let's say you are a very motivated gold-miner looking for gold, with a very tiny mine, so tiny that you can only look for gold vertically. And so you dig, and you check for gold. If you find some, you don't have to dig anymore, just take the gold and go. But if you don't, that means you have to dig deeper. So there are only two things that can stop you:
Finding some gold nugget.
The Earth's boiling kernel of melted iron.
So if you want to write this programmatically -using recursion-, that could be something like this :
// This function only generates a probability of 1/10
function checkForGold() {
let rnd = Math.round(Math.random() * 10);
return rnd === 1;
}
function digUntilYouFind() {
if (checkForGold()) {
return 1; // he found something, no need to dig deeper
}
// gold not found, digging deeper
return digUntilYouFind();
}
let gold = digUntilYouFind();
console.log(`${gold} nugget found`);
Or with a little more interactivity :
// This function only generates a probability of 1/10
function checkForGold() {
console.log("checking...");
let rnd = Math.round(Math.random() * 10);
return rnd === 1;
}
function digUntilYouFind() {
if (checkForGold()) {
console.log("OMG, I found something !")
return 1;
}
try {
console.log("digging...");
return digUntilYouFind();
} finally {
console.log("climbing back...");
}
}
let gold = digUntilYouFind();
console.log(`${gold} nugget found`);
If we don't find some gold, the digUntilYouFind function calls itself. When the miner "climbs back" from his mine it's actually the deepest child call to the function returning the gold nugget through all its parents (the call stack) until the value can be assigned to the gold variable.
Here the probability is high enough to avoid the miner to dig to the earth kernel. The earth kernel is to the miner what the stack size is to a program. When the miner comes to the kernel he dies in terrible pain, when the program exceed the stack size (causes a stack overflow), it crashes.
There are optimization that can be made by the compiler/interpreter to allow infinite level of recursion like tail-call optimization.
Take fractals as being recursive: the same pattern get applied each time, yet each figure differs from another.
As natural phenomena with fractal features, Wikipedia presents:
Moutain ranges
Frost crystals
DNA
and, even, proteins.
This is odd, and not quite a physical example except insofar as dance-movement is physical. It occurred to me the other morning. I call it "Written in Latin, solved in Hebrew." Huh? Surely you are saying "Huh?"
By it I mean that encoding a recursion is usually done left-to-right, in the Latin alphabet style: "Def fac(n) = n*(fac(n-1))." The movement style is "outermost case to base case."
But (please check me on this) at least in this simple case, it seems the easiest way to evaluate it is right-to-left, in the Hebrew alphabet style: Start from the base case and move outward to the outermost case:
(fac(0) = 1)
(fac(1) = 1)*(fac(0) = 1)
(fac(2))*(fac(1) = 1)*(fac(0) = 1)
(fac(n)*(fac(n-1)*...*(fac(2))*(fac(1) = 1)*(fac(0) = 1)
(* Easier order to calculate <<<<<<<<<<< is leftwards,
base outwards to outermost case;
more difficult order to calculate >>>>>> is rightwards,
outermost case to base *)
Then you do not have to suspend items on the left while awaiting the results of calculations further right. "Dance Leftwards" instead of "Dance rightwards"?

Tile Based A* Pathfinding, but with a bomb

I've written a simple A* path finding algorithm to quickly find a way through a tile based dungeon in which the tiles contain the information of walls.
An example of a dungeon (only 1 path for simplicity):
However now I'd like to add a variable amount of "Bombs" to the algorithm which would allow the path-finding to ignore 1 wall. However now it doesn't find the best paths anymore,
for example with use of only 1 bomb the generated path looks like the first image here:
Edit: actually it would look like this: https://i.stack.imgur.com/kPoAA.png
While the correct path would be the second image
The problem is that "Closed Nodes" now interfere with possible paths. Any ideas of how to tackle this problem would be greatly appreciated!
Your "game state" will no longer only be defined by your location, but also by an integer representing the number of bombs you have left. If you're following the pseudocode of A* on wikipedia, this means you cannot simply implement the closedSet as a grid of booleans. It should probably be implemented as, for example, a hash map / hash set, where every entry holds the following data:
x coordinate
y coordinate
number of bombs left
By visiting a certain position in the search process, you'll no longer mark just that position as closed. You'll mark the combination of position + number of bombs left as closed. That way, if later on in the same search process you run into a position where you're at the same location, but have more bombs left, you will not ignore it as closed but will actually continue searching that possibility.
Note that, if the maximum possible number of bombs is relatively low, you could also implement the closedSet as an array of boolean grids, where you first index by number of bombs, then by x and y coordinates to find out if a specific position is closed or not.
Doesn't this mean that you just pretend to not have any walls at all?
Use A* to find the shortest path from start to end and then check how many walls you'd have to go through. If you have enough bombs, you can use the path. Otherwise, try the next longest path and so on.
By the way: you might want to check http://gamedev.stackexchange.com for questions like this one.
You need to tweak the cost function to cost something for a bomb, then run the algorithm normally with an infinite cost for a second bomb. To get the bomb approximately halfway, play about with cost function, it should probably cost about the heuristic A-B distance times the cost for an empty tile. If you have two bombs, half the cost and of course then use of three bombs costs infinite.
But don't expect very good results. A* isn't designed for that kind of optimisation.

Is it ok to create big array of AVX/SSE values

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.
Structure:
struct Cell {
_m256i first;
_m256i second;
};
An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.
It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.
I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.
As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)

CUDA kernel's vectors' length based on threadIdx

This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:
for each xbin(0->detectorXDim/2-1):
for each ybin(0->detectorYDim-1):
rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
rayEnd=beamFocusCoord
slopeVector=rayEnd-rayInit
//knowing that r=rayInit+t*slopeVector;
//x=rayInit[0]+t*slopeVector[0]
//y=rayInit[1]+t*slopeVector[1]
//z=rayInit[2]+t*slopeVector[2]
//to find ray xx intersections:
for each xinteger(xbin+1->detectorXDim/2):
solve t for x=xinteger*xBinSize;
find corresponding y and z
add to intersections array
//find ray yy intersections(analogous to xx intersections)
//find ray zz intersections(analogous to xx intersections)
So far, this is what I have come up with:
__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){
int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
int idx=ty+by*blocksize;
int idy=tx+bx*blocksize;
int slopeVectorx=xfocus-idx*xbinsize+0.5;
int slopeVectory=yfocus-idy*ybinsize+0.5;
int slopeVectorz=zfocus-zdetector;
__syncthreads();
//points where the ray intersects x axis
int xint=idx+1;
int yint=idy+1;
int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
for(xint=xint; xint<detectorXDim/2;xint++){
x=xint*xbinsize;
t=(x-idx)/slopeVectorx;
y=idy+t*slopeVectory;
z=z+t*slopeVectorz;
intersectionsx[xint-1]=x;
intersectionsy[xint-1]=y;
intersectionsz[xint-1]=z;
__syncthreads();
}
...
}
This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:
Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):
int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.
Thank you in advance ;)
EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?
For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.
So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(
So I am running out of ideas to deal with this problem and any help is appreciated.
No.
Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.
The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.
Go for the simple solution if you can and for the complex if you have to, due to memory limitations.
Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.
1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.

Strange CUDA behavior in vector multiplication program

I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.
The problem is the following
I first call the kernel with N=16 and nThreads<16 which doesn't work. (This is ok)
Then I call it with N=16 and nThreads=16 which works fine. (Again
works as expected)
But when I call it with N=16 and nThreads<16 it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?
Wait, so are you calling all three in a row? I don't know the rest of your code, but are you sure you're clearing out the graphics memory you alloced between each run? If not, that could explain why it doesn't work the first time but does the third time when you're passing the same values, and why it only works again after rebooting (rebooting clears all the memory alloced).
Don't know if its ok to answer my own question but I realized I had a bug in my code when comparing the host and device vectors (that part of the code wasn't posted). Sorry for the inconvenience. Could someone please close this post since it won't let me delete it?

Resources