Processing 3 improving intensive math calculation

Processing 3 improving intensive math calculation - math

I wrote a very simple sketch to simulate the interference of two planar waves, very easy.
The problem seems to be a little to much intensive for the cpu (moreover processing uses only one core) and I get only 1 o 2 fps.
Any idea how to improve this sketch?
float x0;
float y0;
float x1;
float y1;
float x2;
float y2;
int t = 0;
void setup() {
//noLoop();
frameRate(30);
size(400, 400, P2D);
x0 = width/2;
y0 = height/2;
x1 = width/4;
y1 = height/2;
x2 = width * 3/4;
y2 = height / 2;
}
void draw() {
background(0);
for (int x = 0; x <= width; x++) {
for (int y = 0; y <= height; y++) {
float d1 = dist(x1, y1, x, y);
float d2 = dist(x2, y2, x, y);
float factorA = 20;
float factorB = 80;
float wave1 = (1 + (sin(TWO_PI * d1/factorA + t)))/2 * exp(-d1/factorB);
float wave2 = (1 + (sin(TWO_PI * d2/factorA + t)))/2 * exp(-d2/factorB);
stroke( (wave1 + wave2) *255);
point(x, y);
}
}
t--; //Wave propagation
//saveFrame("wave-##.png");
}

As Kevin suggested, using point() isn't the most efficient method since it calls beginShape();vertex() and endShape();. You might be off better using pixels.
Additionally, the nested loops can be written as a single loop and dist() which uses square root behind the scenes can be avoided (you can uses squared distance with higher values).
Here's a version using these:
float x1;
float y1;
float x2;
float y2;
int t = 0;
//using larger factors to use squared distance bellow instead of dist(),sqrt()
float factorA = 20*200;
float factorB = 80*200;
void setup() {
//noLoop();
frameRate(30);
size(400, 400);
x1 = width/4;
y1 = height/2;
x2 = width * 3/4;
y2 = height / 2;
//use pixels, not points()
loadPixels();
}
void draw() {
for (int i = 0; i < pixels.length; i++) {
int x = i % width;
int y = i / height;
float dx1 = x1-x;
float dy1 = y1-y;
float dx2 = x2-x;
float dy2 = y2-y;
//squared distance
float d1 = dx1*dx1+dy1*dy1;//dist(x1, y1, x, y);
float d2 = dx2*dx2+dy2*dy2;//dist(x2, y2, x, y);
float wave1 = (1 + (sin(TWO_PI * d1/factorA + t))) * 0.5 * exp(-d1/factorB);
float wave2 = (1 + (sin(TWO_PI * d2/factorA + t))) * 0.5 * exp(-d2/factorB);
pixels[i] = color((wave1 + wave2) *255);
}
updatePixels();
text((int)frameRate+"fps",10,15);
// endShape();
t--; //Wave propagation
//saveFrame("wave-##.png");
}
This can be sped up further using lookup tables for the more time consuming functions such as sin() and exp().
You can see a rough (numbers need to be tweaked) preview running even in javascript:
var x1;
var y1;
var x2;
var y2;
var t = 0;
var factorA = 20*200;
var factorB = 80*200;
var numPixels;
var scaledWidth;
function setup() {
createCanvas(400, 400);
fill(255);
frameRate(30);
x1 = width /4;
y1 = height /2;
x2 = width * 3/4;
y2 = height / 2;
loadPixels();
numPixels = (width * height) * pixelDensity();
scaledWidth = width * pixelDensity();
}
function draw() {
for (var i = 0, j = 0; i < numPixels; i++, j += 4) {
var x = i % scaledWidth;
var y = floor(i / scaledWidth);
var dx1 = x1 - x;
var dy1 = y1 - y;
var dx2 = x2 - x;
var dy2 = y2 - y;
var d1 = (dx1 * dx1) + (dy1 * dy1);//dist(x1, y1, x, y);
var d2 = (dx2 * dx2) + (dy2 * dy2);//dist(x2, y2, x, y);
var wave1 = (1 + (sin(TWO_PI * d1 / factorA + t))) * 0.5 * exp(-d1 / factorB);
var wave2 = (1 + (sin(TWO_PI * d2 / factorA + t))) * 0.5 * exp(-d2 / factorB);
var gray = (wave1 + wave2) * 255;
pixels[j] = pixels[j+1] = pixels[j+2] = gray;
pixels[j+3] = 255;
}
updatePixels();
text(frameRate().toFixed(2)+"fps",10,15);
t--; //Wave propagation
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.0.0/p5.min.js"></script>
Because you're using math to synthesise the image, it may make more sense to write this as a GLSL Shader. Be sure sure to checkout the PShader tutorial for more info.
Update:
Here's a GLSL version: code is less hacky and a lot more readable:
float t = 0;
float factorA = 0.20;
float factorB = 0.80;
PShader waves;
void setup() {
size(400, 400, P2D);
noStroke();
waves = loadShader("waves.glsl");
waves.set("resolution", float(width), float(height));
waves.set("factorA",factorA);
waves.set("factorB",factorB);
waves.set("pt1",-0.5,0.0);
waves.set("pt2",0.75,0.0);
}
void draw() {
t++;
waves.set("t",t);
shader(waves);
rect(0, 0, width, height);
}
void mouseDragged(){
float x = map(mouseX,0,width,-1.0,1.0);
float y = map(mouseY,0,height,1.0,-1.0);
println(x,y);
if(keyPressed) waves.set("pt2",x,y);
else waves.set("pt1",x,y);
}
void keyPressed(){
float amount = 0.05;
if(keyCode == UP) factorA += amount;
if(keyCode == DOWN) factorA -= amount;
if(keyCode == LEFT) factorB -= amount;
if(keyCode == RIGHT) factorB += amount;
waves.set("factorA",factorA);
waves.set("factorB",factorB);
println(factorA,factorB);
}
And the waves.glsl:
#define PROCESSING_COLOR_SHADER
uniform vec2 pt1;
uniform vec2 pt2;
uniform float t;
uniform float factorA;
uniform float factorB;
const float TWO_PI = 6.283185307179586;
uniform vec2 resolution;
uniform float time;
void main(void) {
vec2 p = -1.0 + 2.0 * gl_FragCoord.xy / resolution.xy;
float d1 = distance(pt1,p);
float d2 = distance(pt2,p);
float wave1 = (1.0 + (sin(TWO_PI * d1/factorA + t))) * 0.5 * exp(-d1/factorB);
float wave2 = (1.0 + (sin(TWO_PI * d2/factorA + t))) * 0.5 * exp(-d2/factorB);
float gray = wave1 + wave2;
gl_FragColor=vec4(gray,gray,gray,1.0);
}
You can use drag for first point and hold a key and drag for the second point.
Additionally, use UP/DOWN, LEFT/RIGHT keys to change factorA and factorB. Results look interesting:
Also, you can grab a bit of code from this answer to save frames using Threads (I recommend saving uncompressed).

Option 1: Pre-render your sketch.
This seems to be a static repeating pattern, so you can pre-render it by running the animation ahead of time and saving each frame to an image. I see that you already had a call to saveFrame() in there. Once you have the images saved, you can then load them into a new sketch and play them one frame at a time. It shouldn't require very many images, since it seems to repeat itself pretty quickly. Think of an animated gif that loops forever.
Option 2: Decrease the resolution of your sketch.
Do you really need pixel-perfect 400x400 resolution? Can you maybe draw to an image that's 100x100 and scale up?
Or you could decrease the resolution of your for loops by incrementing by more than 1:
for (int x = 0; x <= width; x+=2) {
for (int y = 0; y <= height; y+=2) {
You could play with how much you increase and then use the strokeWeight() or rect() function to draw larger pixels.
Option 3: Decrease the time resolution of your sketch.
Instead of moving by 1 pixel every 1 frame, what if you move by 5 pixels every 5 frames? Speed your animation up, but only move it every X frames, that way the overall speed appears to be the same. You can use the modulo operator along with the frameCount variable to only do something every X frames. Note that you'd still want to keep the overall framerate of your sketch to 30 or 60, but you'd only change the animation every X frames.
Option 4: Simplify your animation.
Do you really need to calculate every single pixels? If all you want to show is a series of circles that increase in size, there are much easier ways to do that. Calling the ellipse() function is much faster than calling the point() function a bunch of times. You can use other functions to create the blurry effect without calling point() half a million times every second (which is how often you're trying to call it).
Option 5: Refactor your code.
If all else fails, then you're going to have to refactor your code. Most of your program's time is being spent in the point() function- you can prove this by drawing an ellipse at mouseX, mouseY at the end of the draw() function and comparing the performance of that when you comment out the call to point() inside your nested for loops.
Computers aren't magic, so calling the point() function half a million times every second isn't free. You're going to have to decrease that number somehow, either by taking one (or more than one) of the above options, or by refactoring your code in some other way.
How you do that really depends on your actual goals, which you haven't stated. If you're just trying to render this animation, then pre-rendering it will work fine. If you need to have user interaction with it, then maybe something like decreasing the resolution will work. You're going to have to sacrifice something, and it's really up to you what that is.

Related

How to apply an effect to an image using the mouse coordinates?

I coded a program on Processing where all the pixels on the screen are scrambled, but around the cursor. The code works by replacing the pixels with a random pixel between 0 and the pixel the loop is currently on. To find that pixel, I used the code (y*width+x)-1. This code, however, is taking pixels from the entire screen. I want the code to instead take the pixels from a 40m square around the mouse coordinates. How can I do this?
import processing.video.*;
Capture video;
void setup() {
size(640, 480);
video = new Capture(this, 640, 480);
video.start();
}
void draw() {
loadPixels();
if (video.available()){
video.read();
video.loadPixels();
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
pixels[y*width+x] = video.pixels[y*video.width+(width-x-1)];
// the code should only be applied 20 pixels around the mouse
if (dist(mouseX, mouseY, x, y) < 20){
int d = int(random(0, y*width+x-1));
pixels[y*width+x] = video.pixels[d];
}
}
}
}
updatePixels();
}

You don't need to iterate through all the pixels to only change a few.
Luckily your sketch is the same size as the webcam feed, so you're on the right track using the x + (y + width) arithmetic to convert from a 2D array index to the 1D pixels[] index. Remember that you're sampling from a 1D array currently (random 0, coords). Even if you upate the start/end index that's still a range that will span a few full image rows which means pixels to the left and right of the effect selection. I recommend picking the random x, y indices in 2D, then converting these random values to 1D (as opposed to a single index from the 1D array).
Here's what I mean:
import processing.video.*;
Capture video;
void setup() {
size(640, 480);
video = new Capture(this, 640, 480);
video.start();
}
void draw() {
loadPixels();
if (video.available()) {
video.read();
video.loadPixels();
//for (int y = 0; y < height; y++) {
// for (int x = 0; x < width; x++) {
// pixels[y*width+x] = video.pixels[y*video.width+(width-x-1)];
// // the code should only be applied 20 pixels around the mouse
// if (dist(mouseX, mouseY, x, y) < 20) {
// int d = int(random(0, y*width+x-1));
// pixels[y*width+x] = video.pixels[d];
// }
// }
//}
// mouse x, y shorthand
int mx = mouseX;
int my = mouseY;
// random pixels effect size
int size = 40;
// half of size
int hsize = size / 2;
// 2D pixel coordinates of the effect's bounding box
int minX = mx - hsize;
int maxX = mx + hsize;
int minY = my - hsize;
int maxY = my + hsize;
// apply the effect only where the bounding can be applied
// e.g. avoid a border (of hsize) around edges of the image
if (mx >= hsize && mx < width - hsize &&
my >= hsize && my < height - hsize) {
for(int y = minY; y < maxY; y++){
for(int x = minX; x < maxX; x++){
// pick random x,y coordinates to sample a pixel from
int rx = (int)random(minX, maxX);
int ry = (int)random(minY, maxY);
// convert the 2D random coordinates to a 1D pixel[] index
int ri = rx + (ry * width);
// replace current pixel with randomly sampled pixel (within effect bbox)
pixels[x + (y * width)] = video.pixels[ri];
}
}
}
}
updatePixels();
}
(Note that the above isn't tested, but hopefully the point gets across)

Different results GPU & CPU when more than one 8 work items per group

I'm new in open cl. And tried as my first work to write code that checks intersection between many polylines to single polygon.
I'm running the code in both cpu and gpu.. and get different results.
First I sent NULL as local parameter when called clEnqueueNDRangeKernel.
clEnqueueNDRangeKernel(command_queue, kIntersect, 1, NULL, &global, null, 2, &evtCalcBounds, &evtKernel);
After trying many things i saw that if i send 1 as local it is working good. and returning the same results for the cpu and gpu.
size_t local = 1;
clEnqueueNDRangeKernel(command_queue, kIntersect, 1, NULL, &global, &local, 2, &evtCalcBounds, &evtKernel);
Played abit more and found that the cpu returns false result when i run the kernel with local 8 or more (for some reason).
I'm not using any local memory, just globals and privates.
I didn't added the code because i think it is irrelevant to the problem (note that for single work group it is working good), and it is long. If it is needed, i will try to simplify it.
The code flow is going like this:
I have polylines coordinates stored in a big buffer. and the single polygon in another. In addition i'm providing another buffer with single int that holds the current results count. All buffers are __global arguments.
In the kernel i'm simply checking intersection between all the lines of the "polyline[get_global(0)]" with the lines of the polygon. If true,
i'm using atomic_inc for the results count. There is no read and write memory from the same buffer, no barriers or mem fences,... the atomic_inc is the only thread safe mechanism i'm using.
-- UPDATE --
Added my code:
I know that i can maybe have better use of open cl functions for calculating some vectors, but for now, i'm simply convert code from my old regular CPU single threaded program to CL. so this is not my concern now.
bool isPointInPolygon(float x, float y, __global float* polygon) {
bool blnInside = false;
uint length = convert_uint(polygon[4]);
int s = 5;
uint j = length - 1;
for (uint i = 0; i < length; j = i++) {
uint realIdx = s + i * 2;
uint realInvIdx = s + j * 2;
if (((polygon[realIdx + 1] > y) != (polygon[realInvIdx + 1] > y)) &&
(x < (polygon[realInvIdx] - polygon[realIdx]) * (y - polygon[realIdx + 1]) / (polygon[realInvIdx + 1] - polygon[realIdx + 1]) + polygon[realIdx]))
blnInside = !blnInside;
}
return blnInside;
}
bool isRectanglesIntersected(float p_dblMinX1, float p_dblMinY1,
float p_dblMaxX1, float p_dblMaxY1,
float p_dblMinX2, float p_dblMinY2,
float p_dblMaxX2, float p_dblMaxY2) {
bool blnResult = true;
if (p_dblMinX1 > p_dblMaxX2 ||
p_dblMaxX1 < p_dblMinX2 ||
p_dblMinY1 > p_dblMaxY2 ||
p_dblMaxY1 < p_dblMinY2) {
blnResult = false;
}
return blnResult;
}
bool isLinesIntersects(
double Ax, double Ay,
double Bx, double By,
double Cx, double Cy,
double Dx, double Dy) {
double distAB, theCos, theSin, newX, ABpos;
// Fail if either line is undefined.
if (Ax == Bx && Ay == By || Cx == Dx && Cy == Dy)
return false;
// (1) Translate the system so that point A is on the origin.
Bx -= Ax; By -= Ay;
Cx -= Ax; Cy -= Ay;
Dx -= Ax; Dy -= Ay;
// Discover the length of segment A-B.
distAB = sqrt(Bx*Bx + By*By);
// (2) Rotate the system so that point B is on the positive X axis.
theCos = Bx / distAB;
theSin = By / distAB;
newX = Cx*theCos + Cy*theSin;
Cy = Cy*theCos - Cx*theSin; Cx = newX;
newX = Dx*theCos + Dy*theSin;
Dy = Dy*theCos - Dx*theSin; Dx = newX;
// Fail if the lines are parallel.
return (Cy != Dy);
}
bool isPolygonInersectsPolyline(__global float* polygon, __global float* polylines, uint startIdx) {
uint polylineLength = convert_uint(polylines[startIdx]);
uint start = startIdx + 1;
float x1 = polylines[start];
float y1 = polylines[start + 1];
float x2;
float y2;
int polygonLength = convert_uint(polygon[4]);
int polygonLength2 = polygonLength * 2;
int startPolygonIdx = 5;
for (int currPolyineIdx = 0; currPolyineIdx < polylineLength - 1; currPolyineIdx++)
{
x2 = polylines[start + (currPolyineIdx*2) + 2];
y2 = polylines[start + (currPolyineIdx*2) + 3];
float polyX1 = polygon[0];
float polyY1 = polygon[1];
for (int currPolygonIdx = 0; currPolygonIdx < polygonLength; ++currPolygonIdx)
{
float polyX2 = polygon[startPolygonIdx + (currPolygonIdx * 2 + 2) % polygonLength2];
float polyY2 = polygon[startPolygonIdx + (currPolygonIdx * 2 + 3) % polygonLength2];
if (isLinesIntersects(x1, y1, x2, y2, polyX1, polyY1, polyX2, polyY2)) {
return true;
}
polyX1 = polyX2;
polyY1 = polyY2;
}
x1 = x2;
y1 = y2;
}
// No intersection found till now so we check containing
return isPointInPolygon(x1, y1, polygon);
}
__kernel void calcIntersections(__global float* polylines, // My flat points array - [pntCount, x,y,x,y,...., pntCount, x,y,... ]
__global float* pBounds, // The rectangle bounds of each polyline - set of 4 values [top, left, bottom, right....]
__global uint* pStarts, // The start index of each polyline in the polylines array
__global float* polygon, // The polygon i want to intersect with - first 4 items are the rectangle bounds [top, left, bottom, right, pntCount, x,y,x,y,x,y....]
__global float* output, // Result array for saving the intersections polylines indices
__global uint* resCount) // The result count
{
int i = get_global_id(0);
uint start = convert_uint(pStarts[i]);
if (isRectanglesIntersected(pBounds[i * 4], pBounds[i * 4 + 1], pBounds[i * 4 + 2], pBounds[i * 4 + 3],
polygon[0], polygon[1], polygon[2], polygon[3])) {
if (isPolygonInersectsPolyline(polygon, polylines, start)){
int oldVal = atomic_inc(resCount);
output[oldVal] = i;
}
}
}
Can anyone explain it to me ?

OpenCL function calls

I'm working on an openCL kernel that loads up some points, decides which is the highest, and returns it. All good there, but I want to add a calculation before the highest evaluation. This compares the point to a pair of lines. I have it written and working to a degree, as follows:
size_t i = group_id * group_stride + local_id;
while (i < n){
//load up a pair of points using the index to locate them within a massive dataSet
int ia = LOAD_GLOBAL_I1(input, i);
float4 a = LOAD_GLOBAL_F4(dataSet, ia);
int ib = LOAD_GLOBAL_I1(input, i + group_size);
float4 b = LOAD_GLOBAL_F4(dataSet, ib);
//pre-assess the points relative to lines
if(pass == 0){
float px = a.x;
float py = a.y;
int checkAnswer;
//want to write this section as a function
float x1 = tri_input[0].x; float y1 = tri_input[0].y;
float x2 = tri_input[2].x; float y2 = tri_input[2].y;
float check = sign((x1-x2) * (py-y1) - (y2-y1) * (px-x1));
if(check != tri_input[3].x){ //point is outside line 1
checkAnswer = 1;
}
else{
x1 = tri_input[2].x; y1 = tri_input[2].y;
x2 = tri_input[1].x; y2 = tri_input[1].y;
check = sign((x1-x2)*(py-y1) - (y2-y1)*(px-x1));
if(check != tri_input[3].y){ //point is outside line 2
checkAnswer = 2;
}
else{
checkAnswer = 0; //point is within both lines
}}}
//later use the checkAnswer result to change the following
//find the highest of the pair
float4 result;
if(a.z>b.z) result = a;
else result = b;
//load up the previous highest result locally
float4 s = LOAD_LOCAL_F4(shared, local_id);
//if the previous highest beat this, stick, else twist
if(s.z>result.z){ STORE_LOCAL_F4(shared, local_id, s);}
else{ STORE_LOCAL_F4(shared, local_id, result);}
i += local_stride;
}
What I would like to do is call the line check twice as a function, i.e the code becomes:
size_t i = group_id * group_stride + local_id;
while (i < n){
//load up a pair of points using the index to locate them within a massive dataSet
int ia = LOAD_GLOBAL_I1(input, i);
float4 a = LOAD_GLOBAL_F4(dataSet, ia);
int ib = LOAD_GLOBAL_I1(input, i + group_size);
float4 b = LOAD_GLOBAL_F4(dataSet, ib);
//pre-assess the points relative to lines
if(pass == 0){
float px = a.x;
float py = a.y;
int checkA = pointCheck( px, py, tri_input);
px = b.x;
py = b.y;
int checkB = pointCheck( px, py, tri_input);
}
//later use the checkAnswer result to change the following
//find the highest of the pair
float4 result;
if(a.z>b.z) result = a;
else result = b;
//load up the previous highest result locally
float4 s = LOAD_LOCAL_F4(shared, local_id);
//if the previous highest beat this, stick, else twist
if(s.z>result.z){ STORE_LOCAL_F4(shared, local_id, s);}
else{ STORE_LOCAL_F4(shared, local_id, result);}
i += local_stride;
}
In this instance the function is:
int pointCheck( float *px, float *py, float2 *testLines){
float x1 = testLines[0].x; float y1 = testLines[0].y;
float x2 = testLines[2].x; float y2 = testLines[2].y;
float check = sign((x1-x2) * (py-y1) - (y2-y1) * (px-x1));
if(check != testLines[3].x){ //point is outside line 1
return 1;
}
else{
x1 = testLines[2].x; y1 = testLines[2].y;
x2 = testLines[1].x; y2 = testLines[1].y;
check = sign((x1-x2)*(py-y1) - (y2-y1)*(px-x1));
if(check != testLines[3].y){ //point is outside line 2
return 2;
}
else{
return 0; //point is within both lines
}}}
Whilst the longhand version runs fine and returns a normal 'highest point' result, the function version returns an erroneous result (not detecting the highest point I have hidden in the data set). It produces a wrong result even though the function as yet has no overall effect.
What am I doing wrong?
S
[Update]:
This revised function works as far as the commented out line, then hangs on something:
int pointCheck(float4 *P, float2 *testLines){
float2 *l0 = &testLines[0];
float2 *l1 = &testLines[1];
float2 *l2 = &testLines[2];
float2 *l3 = &testLines[3];
float x1 = l0->x; float y1 = l0->y;
float x2 = l2->x; float y2 = l2->y;
float pX = P->x; float pY = P->y;
float c1 = l3->x; float c2 = l3->y;
//float check = sign((x1-x2) * (pY-y1) - (y2-y1) * (pX-x1)); //seems to be a problem with sign
// if(check != c1){ //point is outside line 1
// return 1;
// }
// else{
// x1 = l2->x; y1 = l2->y;
// x2 = l1->x; y2 = l1->y;
// check = sign((x1-x2) * (pY-y1) - (y2-y1) * (pX-x1));
// if(check != c2){ //point is outside line 2
// return 2;
// }
// else{
// return 0; //point is within both lines
// }}
}

One immediate issue is how you pass the parameters to the called function:
int checkA = pointCheck( px, py, tri_input);
whereas the function itself expects pointers for px and py. You should instead call the function as:
int checkA = pointCheck(&px, &py, tri_input);
It is surprising that OpenCL does not give build errors for this kernel.
In my experience, some OpenCL runtimes do not like multiple return statements in a single function. Try to save the return value into a local variable and use a single return statement at the end of the function. This is because OpenCL does not support real function calls, but rather inlines all functions directly into the kernel. A best practice is therefore to mark all non __kernel functions as inline, and treat them as such (i.e. make it easier for the compiler to inline your function by not using multiple return statements).

OpenCL traversal kernel - further optimization

Currently, I have an OpenCL kernel for like traversal as below. I'd be glad if someone had some point on optimization of this quite large kernel.
The thing is, I'm running this code with SAH BVH and I'd like to get performance similar to Timo Aila with his traversals in his paper (Understanding the Efficiency of Ray Traversal on GPUs), of course his code uses SplitBVH (which I might consider using in place of SAH BVH, but in my opinion it has really slow build times). But I'm asking about traversal, not BVH (also I've so far worked only with scenes, where SplitBVH won't give you much advantages over SAH BVH).
First of all, here is what I have so far (standard while-while traversal kernel).
__constant sampler_t sampler = CLK_FILTER_NEAREST;
// Inline definition of horizontal max
inline float max4(float a, float b, float c, float d)
{
return max(max(max(a, b), c), d);
}
// Inline definition of horizontal min
inline float min4(float a, float b, float c, float d)
{
return min(min(min(a, b), c), d);
}
// Traversal kernel
__kernel void traverse( __read_only image2d_t nodes,
__global const float4* triangles,
__global const float4* rays,
__global float4* result,
const int num,
const int w,
const int h)
{
// Ray index
int idx = get_global_id(0);
if(idx < num)
{
// Stack
int todo[32];
int todoOffset = 0;
// Current node
int nodeNum = 0;
float tmin = 0.0f;
float depth = 2e30f;
// Fetch ray origin, direction and compute invdirection
float4 origin = rays[2 * idx + 0];
float4 direction = rays[2 * idx + 1];
float4 invdir = native_recip(direction);
float4 temp = (float4)(0.0f, 0.0f, 0.0f, 1.0f);
// Traversal loop
while(true)
{
// Fetch node information
int2 nodeCoord = (int2)((nodeNum << 2) % w, (nodeNum << 2) / w);
int4 specs = read_imagei(nodes, sampler, nodeCoord + (int2)(3, 0));
// While node isn't leaf
while(specs.z == 0)
{
// Fetch child bounding boxes
float4 n0xy = read_imagef(nodes, sampler, nodeCoord);
float4 n1xy = read_imagef(nodes, sampler, nodeCoord + (int2)(1, 0));
float4 nz = read_imagef(nodes, sampler, nodeCoord + (int2)(2, 0));
// Test ray against child bounding boxes
float oodx = origin.x * invdir.x;
float oody = origin.y * invdir.y;
float oodz = origin.z * invdir.z;
float c0lox = n0xy.x * invdir.x - oodx;
float c0hix = n0xy.y * invdir.x - oodx;
float c0loy = n0xy.z * invdir.y - oody;
float c0hiy = n0xy.w * invdir.y - oody;
float c0loz = nz.x * invdir.z - oodz;
float c0hiz = nz.y * invdir.z - oodz;
float c1loz = nz.z * invdir.z - oodz;
float c1hiz = nz.w * invdir.z - oodz;
float c0min = max4(min(c0lox, c0hix), min(c0loy, c0hiy), min(c0loz, c0hiz), tmin);
float c0max = min4(max(c0lox, c0hix), max(c0loy, c0hiy), max(c0loz, c0hiz), depth);
float c1lox = n1xy.x * invdir.x - oodx;
float c1hix = n1xy.y * invdir.x - oodx;
float c1loy = n1xy.z * invdir.y - oody;
float c1hiy = n1xy.w * invdir.y - oody;
float c1min = max4(min(c1lox, c1hix), min(c1loy, c1hiy), min(c1loz, c1hiz), tmin);
float c1max = min4(max(c1lox, c1hix), max(c1loy, c1hiy), max(c1loz, c1hiz), depth);
bool traverseChild0 = (c0max >= c0min);
bool traverseChild1 = (c1max >= c1min);
nodeNum = specs.x;
int nodeAbove = specs.y;
// We hit just one out of 2 childs
if(traverseChild0 != traverseChild1)
{
if(traverseChild1)
{
nodeNum = nodeAbove;
}
}
// We hit either both or none
else
{
// If we hit none, pop node from stack (or exit traversal, if stack is empty)
if (!traverseChild0)
{
if(todoOffset == 0)
{
break;
}
nodeNum = todo[--todoOffset];
}
// If we hit both
else
{
// Sort them (so nearest goes 1st, further 2nd)
if(c1min < c0min)
{
unsigned int tmp = nodeNum;
nodeNum = nodeAbove;
nodeAbove = tmp;
}
// Push further on stack
todo[todoOffset++] = nodeAbove;
}
}
// Fetch next node information
nodeCoord = (int2)((nodeNum << 2) % w, (nodeNum << 2) / w);
specs = read_imagei(nodes, sampler, nodeCoord + (int2)(3, 0));
}
// If node is leaf & has some primitives
if(specs.z > 0)
{
// Loop through primitives & perform intersection with them (Woop triangles)
for(int i = specs.x; i < specs.y; i++)
{
// Fetch first point from global memory
float4 v0 = triangles[i * 4 + 0];
float o_z = v0.w - origin.x * v0.x - origin.y * v0.y - origin.z * v0.z;
float i_z = 1.0f / (direction.x * v0.x + direction.y * v0.y + direction.z * v0.z);
float t = o_z * i_z;
if(t > 0.0f && t < depth)
{
// Fetch second point from global memory
float4 v1 = triangles[i * 4 + 1];
float o_x = v1.w + origin.x * v1.x + origin.y * v1.y + origin.z * v1.z;
float d_x = direction.x * v1.x + direction.y * v1.y + direction.z * v1.z;
float u = o_x + t * d_x;
if(u >= 0.0f && u <= 1.0f)
{
// Fetch third point from global memory
float4 v2 = triangles[i * 4 + 2];
float o_y = v2.w + origin.x * v2.x + origin.y * v2.y + origin.z * v2.z;
float d_y = direction.x * v2.x + direction.y * v2.y + direction.z * v2.z;
float v = o_y + t * d_y;
if(v >= 0.0f && u + v <= 1.0f)
{
// We got successful hit, store the information
depth = t;
temp.x = u;
temp.y = v;
temp.z = t;
temp.w = as_float(i);
}
}
}
}
}
// Pop node from stack (if empty, finish traversal)
if(todoOffset == 0)
{
break;
}
nodeNum = todo[--todoOffset];
}
// Store the ray traversal result in global memory
result[idx] = temp;
}
}
First question of the day is, how could one write his Persistent while-while and Speculative while-while kernel in OpenCL?
Ad Persistent while-while, do I get it right, that I actually just start kernel with global work size equivalent to local work size, and both these numbers should be equal to warp/wavefront size of the GPU?
I get that with CUDA the persistent thread implementation looks like this:
do
{
volatile int& jobIndexBase = nextJobArray[threadIndex.y];
if(threadIndex.x == 0)
{
jobIndexBase = atomicAdd(&warpCounter, WARP_SIZE);
}
index = jobIndexBase + threadIndex.x;
if(index >= totalJobs)
return;
/* Perform work for task numbered 'index' */
}
while(true);
How could equivalent in OpenCL look like, I know I'll have to do some barriers in there, I also know that one should be after the score where I atomically add WARP_SIZE to warpCounter.
Ad Speculative traversal - well I probably don't have any ideas how this should be implemented in OpenCL, so any hints are welcome. I also don't have idea where to put barriers (because putting them around simulated __any will result in driver crash).
If you made it here, thanks for reading & any hints, answers, etc. are welcome!

An optimization you can do is use vector variables and the fused multiply add function to speed up your set up math. As for the rest of the kernel, It is slow because it is branchy. If you can make assumptions on the signal data you might be able to reduce the execution time by reducing the code branches. I have not checked the float4 swizles (the .xxyy and .x .y .z .w after the float 4 variables) so just check that.
float4 n0xy = read_imagef(nodes, sampler, nodeCoord);
float4 n1xy = read_imagef(nodes, sampler, nodeCoord + (int2)(1, 0));
float4 nz = read_imagef(nodes, sampler, nodeCoord + (int2)(2, 0));
float4 oodf4 = -origin * invdir;
float4 c0xyf4 = fma(n0xy,invdir.xxyy,oodf4);
float4 c0zc1z = fma(nz,(float4)(invdir.z),oodf4);
float c0min = max4(min(c0xyf4.x, c0xyf4.y), min(c0xyf4.z, c0xyf4.w), min(c0zc1z.z, c0zc1z.w), tmin);
float c0max = min4(max(c0xyf4.x, c0xyf4.y), max(c0xyf4.z, c0xyf4.w), max(c0zc1z.z, c0zc1z.w), depth);
float4 c1xy = fma(n1xy,invdir.xxyy,oodf4);
float c1min = max4(min(c1xy.x, c1xy.y), min(c1xy.z, c1xy.w), min(c0zc1z.z, c0zc1z.w), tmin);
float c1max = min4(max(c1xy.x, c1xy.y), max(c1xy.z, c1xy.w), max(c0zc1z.z, c0zc1z.w), depth);

How do I optimize displaying a large number of quads in OpenGL?

I am trying to display a mathematical surface f(x,y) defined on a XY regular mesh using OpenGL and C++ in an effective manner:
struct XYRegularSurface {
double x0, y0;
double dx, dy;
int nx, ny;
XYRegularSurface(int nx_, int ny_) : nx(nx_), ny(ny_) {
z = new float[nx*ny];
}
~XYRegularSurface() {
delete [] z;
}
float& operator()(int ix, int iy) {
return z[ix*ny + iy];
}
float x(int ix, int iy) {
return x0 + ix*dx;
}
float y(int ix, int iy) {
return y0 + iy*dy;
}
float zmin();
float zmax();
float* z;
};
Here is my OpenGL paint code so far:
void color(QColor & col) {
float r = col.red()/255.0f;
float g = col.green()/255.0f;
float b = col.blue()/255.0f;
glColor3f(r,g,b);
}
void paintGL_XYRegularSurface(XYRegularSurface &surface, float zmin, float zmax) {
float x, y, z;
QColor col;
glBegin(GL_QUADS);
for(int ix = 0; ix < surface.nx - 1; ix++) {
for(int iy = 0; iy < surface.ny - 1; iy++) {
x = surface.x(ix,iy);
y = surface.y(ix,iy);
z = surface(ix,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy);
y = surface.y(ix + 1, iy);
z = surface(ix + 1,iy);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix + 1, iy + 1);
y = surface.y(ix + 1, iy + 1);
z = surface(ix + 1,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
x = surface.x(ix, iy + 1);
y = surface.y(ix, iy + 1);
z = surface(ix,iy + 1);
col = rainbow(zmin, zmax, z);color(col);
glVertex3f(x, y, z);
}
}
glEnd();
}
The problem is that this is slow, nx=ny=1000 and fps ~= 1.
How do I optimize this to be faster?
EDIT: following your suggestion (thanks!) regarding VBO
I added:
float* XYRegularSurface::xyz() {
float* data = new float[3*nx*ny];
long i = 0;
for(int ix = 0; ix < nx; ix++) {
for(int iy = 0; iy < ny; iy++) {
data[i++] = x(ix,iy);
data[i++] = y(ix,iy);
data[i] = z[i]; i++;
}
}
return data;
}
I think I understand how I can create a VBO, initialize it to xyz() and send it to the GPU in one go, but how do I use the VBO when drawing. I understand that this can either be done in the vertex shader or by glDrawElements? I assume the latter is easier? If so: I do not see any QUAD mode in the documentation for glDrawElements!?
Edit2:
So I can loop trough all nx*ny quads and draw each by:
GL_UNSIGNED_INT indices[4];
// ... set indices
glDrawElements(GL_QUADS, 1, GL_UNSIGNED_INT, indices);
?

1/. Use display lists, to cache GL commands - avoiding recalculation of the vertices and the expensive per-vertex call overhead. If the data is updated, you need to look at client-side vertex arrays (not to be confused with VAOs). Now ignore this option...
2/. Use vertex buffer objects. Available as of GL 1.5.
Since you need VBOs for core profile anyway (i.e., modern GL), you can at least get to grips with this first.

Well, you've asked a rather open ended question. I'd suggest using modern (3.0+) OpenGL for everything. The point of just about any new OpenGL feature is to provide a faster way to do things. Like everyone else is suggesting, use array (vertex) buffer objects and vertex array objects. Use an element array (index) buffer object too. Most GPUs have a 'post-transform cache', which stores the last few transformed vertices, but this can only be used when you call the glDraw*Elements family of functions. I also suggest you store a flat mesh in your VBO, where y=0 for each vertex. Sample the y from a heightmap texture in your vertex shader. If you do this, whenever the surface changes you will only need to update the heightmap texture, which is easier than updating the VBO. Use one of the floating point or integer texture formats for a heightmap, so you aren't restricted to having your values be between 0 and 1.

If so: I do not see any QUAD mode in the documentation for glDrawElements!?
If you want quads make sure you're looking at the GL 2.1-era docs, not the new stuff.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Processing 3 improving intensive math calculation - math

Related

How to apply an effect to an image using the mouse coordinates?

Different results GPU & CPU when more than one 8 work items per group

OpenCL function calls

OpenCL traversal kernel - further optimization

How do I optimize displaying a large number of quads in OpenGL?

Categories

Resources