Manual loop unrolling with known maximum size

Manual loop unrolling with known maximum size - opencl

Please take a look at this code in an OpenCL kernel:
uint point_color = 4278190080;
float point_percent = 1.0f;
float near_pixel_size = (...);
float far_pixel_size = (...);
float delta_pixel_size = far_pixel_size - near_pixel_size;
float3 near = (...);
float3 far = (...);
float3 direction = normalize(far - near);
point_position = (...) + 10;
for (size_t p = 0; p < point_count; p++, position += 4)
{
float3 point = (float3)(point_list[point_position], point_list[point_position + 1], point_list[point_position + 2]);
float projection = dot(point - near, direction);
float3 projected = near + direction * projection;
float rejection_length = distance(point, projected);
float percent = projection / segment_length;
float pixel_size = near_pixel_size + percent * delta_pixel_size;
bool is_candidate = (pixel_size > rejection_length && point_percent > percent);
point_color = (is_candidate ? (uint)point_list[point_position + 3] | 4278190080 : point_color);
point_percent = (is_candidate ? percent : point_percent);
}
This code attempts to find the point in a list that is nearest to the line segment between far and near, and assigning its color to point_color and its "percentual distance" into point_percent. (Incidentally, the code seems to be OK).
The number of elements specified by point_count is variable, so I cannot assume too much about it, save for one thing: point_count will always be equal or less than 8. That's a fixed fact in my code and data.
I would like to unroll this loop manually, and I'm afraid I will need to use lots of
value = (point_count < constant ? new_value : value)
for all lines in it. In your experience, will such a strategy increase performance in my kernel?
And yes, I know, I should be performing some benchmarking by myself; I just wanted to ask someone with lots of experience in OpenCL before actually attempting this on my own.

Most OpenCL drivers (that I'm familiar with, at least) support the use of #pragma unroll to unroll loops at compile time. Simply use it like so:
#pragma unroll
for (int i = 0; i < 4; i++) {
/* ... */
}
It's effectively the same as unrolling it manually, with none of the effort. In your case, this would probably look more like:
if (pointCount == 1) {
/* ... */
} else if (pointCount == 2) {
#pragma unroll
for (int i = 0; i < 2; i++) { /* ... */ }
} else if (pointCount == 3) {
#pragma unroll
for (int i = 0; i < 3; i++) { /* ... */ }
}
I can't say for certain whether there will be an improvement, but there's one way to find out. If pointCount is constant for the local work group for example, it might improve performance, but if it's completely variable, this might actually make things worse.
You can read more about it here.

Related

Drawing the "Seed of Life" without redrawing anything

I have a mildly interesting problem which I can't quite figure out (although in fairness, I am pretty drunk)
The "Seed of Life" is a pattern created from drawing circles of equal radius, centred on the intersection of the previous circle.
Language doesn't really matter, the theory is more important here. Anything which can draw a circle will do it. For example, HTML5 + JS canvas can do it. It's a lovely example of how recursion can help solve problems.
The problem is that a naive approach will end up redrawing many, many circles. With 7 layers, you'll end up with over 300,000 circle draws.
A simple approach is to maintain a list of previous circle centre points, and only draw circles which are not in that list.
My question is whether there's a "better" way to approach this? Something which doesn't require checking that list.
A fun problem to ponder.

I think I have this solved thanks to a friend. I'll post here what I'm doing now in case someone ever is curious.
In short, starting from the center and working out, calculate the vertices of a hexagon, and subdivide each edge of the hexagon into i number of places, where i is the layer number.
I drew it in C# using SkiaSharp, but the code is nothing special to the language, there's no reason this couldn't be written in any language. Here's the significant bits:
const float seedAngle = (float)(Math.PI / 3.0);
static void SeedOfLifeDemo(int x, int y) {
//setting up Skia stuff, this will be different depending what language you're using.
var info = new SKImageInfo(x, y);
using var bitmap = FlatImage(info, SKColors.White);
SKCanvas canvas = new SKCanvas(bitmap);
float radius = Math.Min(x, y) / 15;
SKPoint center = new SKPoint(x / 2f, y / 2f);
SKPaint strokePaint = new SKPaint {
Color = SKColors.Black,
Style = SKPaintStyle.Stroke,
StrokeWidth = 1,
IsAntialias = true,
};
int layers = 4;
//Draw the very central circle. This is just a little easier than adding that edge case to SubdividedHexagonAboutPoint
canvas.DrawCircle(center, radius, strokePaint);
for (int i = 1; i <= layers; i++) {
foreach (SKPoint p in SubdividedHexagonAboutPoint(center, radius * i, i)) {
canvas.DrawCircle(p, radius, strokePaint);
}
}
SaveImage(bitmap, "SeedOfLifeFastDemo.Jpg");//More Skia specific stuff
}
//The magic!
static List<SKPoint> SubdividedHexagonAboutPoint(SKPoint centre, float radius, int subdivisions) {
List<SKPoint> points = new List<SKPoint>(6 * subdivisions);
SKPoint? prevPoint = null;
for (int i = 0; i < 7; i++) {//Step around the circle. The 7th step is to close the last edge
float x = (float)(Math.Sin(seedAngle * i) * radius + centre.X);
float y = (float)(Math.Cos(seedAngle * i) * radius + centre.Y);
SKPoint point = new SKPoint(x, y);
if (prevPoint != null) {
points.Add(point);//include the "primary" 6 points
if (subdivisions > 0) {
float xDist = (point.X - prevPoint.Value.X) / subdivisions;
float yDist = (point.Y - prevPoint.Value.Y) / subdivisions;
for (int sub = 1; sub < subdivisions; sub++) {
SKPoint subPoint = new SKPoint(point.X - xDist * sub, point.Y - yDist * sub);
points.Add(subPoint);//include the edge subdivisions
}
}
}
prevPoint = point;
}
return points;
}
This is quite an interesting exercise really, and another example of where recursion can really bite you when used badly.

OpenCL Optimization

Im new in OpenCL.
I wrote an OpenCL kernel to compute grayscale. How Can I optimize that code, is possible? Why the computational time is floating so much? Sometimes Im speedup others not. Im doing something wrong?
kernel code:
kernel void grayscale(__global unsigned char *input)
{
size_t i = get_global_id(0);
float grayscaleValue = (input[i*3] * 0.299F) + (input[i*3+1] * 0.587F) + (input[i*3+2] * 0.114F);
input[i*3] = grayscaleValue;
input[i*3+1] = grayscaleValue;
input[i*3+2] = grayscaleValue;
}
cpu code:
void GrayScaleCPU(struct PPMFile *ppmStruct)
{
for (int i = 0; i < ppmStruct->imageSize; i+=3)
{
float greyscaleValue = (ppmStruct->data[i] * 0.299F) + (ppmStruct->data[i+1] * 0.587F) + (ppmStruct->data[i+2] * 0.114F);
ppmStruct->out[i] = greyscaleValue;
ppmStruct->out[i+1] = greyscaleValue;
ppmStruct->out[i+2] = greyscaleValue;
}
}
int main(void)
{
struct timespec tS1, tS2;
tS1.tv_sec = 0;
tS1.tv_nsec = 0;
tS2.tv_sec = 0;
tS2.tv_nsec = 0;
...
clock_settime(CLOCK_REALTIME, &tS1);
GrayScaleCPU(ppmf);
clock_gettime(CLOCK_REALTIME, &tS1);
printf ("Timming took %.12lu seconds to run.\n", tS1.tv_nsec);
...
clock_settime(CLOCK_REALTIME, &tS2);
GrayScaleOpenCL(ppmf2);
clock_gettime(CLOCK_REALTIME, &tS2);
printf ("Timming took %.12lu seconds to run.\n", tS2.tv_nsec);
float time2 = tS2.tv_nsec;
float time1 = tS1.tv_nsec;
float speedup = time2/time1;
printf ("Speed UP OpenCL/CPU %.20f.\n", speedup);
return 0;
}

Try buffering your global memory into thread memory:
unsigned char l_input0 = input[i*3];
unsigned char l_input1 = input[i*3 + 1];
unsigned char l_input2 = input[i*3 + 2];
//compute grayscale using l_input0,1,2
input[i*3] = grayscale;
input[i*3 + 1] = grayscale;
input[i*3 + 2] = grayscale;
Also, if your data isn't spaced properly when you call your kernel, you may end up executing on each unsigned char, instead of every 3rd unsigned char as in your for loop example.
You can then go further using local memory and work groups and do your calculations in chunks, though that is more challenging as local work sizes are very device specific and need to be a multiple of the global work size. I've found local work sizes of 16, 32, and 64 work on most devices.
Finally, you benchmarking OpenCL, make sure you are measuring kernel performance and not kernel enqueue time. The easiest way to do this is to start a timer, enqueue you kernel, call clainish on the queue, then stop the timer. There are timing and profiling built into most OpenCL devices which are handled by the queue.

Get ints (of various sizes) from boolean array

OK, say I have a boolean array called bits, and an int called cursor
I know I can access individual bits by using bits[cursor], and that I can use bit logic to get larger datatypes from bits, for example:
short result = (bits[cursor] << 3) |
(bits[cursor+1] << 2) |
(bits[cursor+2] << 1) |
bits[cursor+3];
This is going to result in lines and lines of code when reading larger types like int32 and int64 though.
Is it possible to do a cast of some kind and achieve the same result? I'm not concerned about safety at all in this context (these functions will be wrapped into a class that handles that)
Say I wanted to get an uint64_t out of bits, starting at an arbitrary address specified by cursor, when cursor isn't necessarily a multiple of 64; is this possible by a cast? I thought this
uint64_t result = (uint64_t *)(bits + cursor)[0];
Would work, but it doesn't want to compile.
Sorry I know this is a dumb question, I'm quite inexperienced with pointer math. I'm not looking just for a short solution, I'm also looking for a breakdown of the syntax if anyone would be kind enough.
Thanks!

You could try something like this and cast the result to your target data size.
uint64_t bitsToUint64(bool *bits, unsigned int bitCount)
{
uint64_t result = 0;
uint64_t tempBits = 0;
if(bitCount > 0 && bitCount <= 64)
{
for(unsigned int i = 0, j = bitCount - 1; i < bitCount; i++, j--)
{
tempBits = (bits[i])?1:0;
result |= (tempBits << j);
}
}
return result;
}

boosting parallel reduction OpenCL

I have an algorithm, performing two-staged parallel reduction on GPU to find the smallest elemnt in a string. I know that there is a hint on how to make it work faster, but I don't know what it is. Any ideas on how I can tune this kernel to speed my program up? It is not necessary to actually change algorithm, may be there are other tricks. All ideas are welcome.
Thank you!
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}

accumulator = (accumulator < element) ? accumulator : element;
Use fmin function - it is exactly what you need, and it may result in faster code (call to built-in instruction, if available, instead of costly branching)
global_index += get_global_size(0);
What is your typical get_global_size(0)?
Though your access pattern is not very bad (it is coalesced, 128byte chunks for 32-warp) - it is better to access memory sequentially whenever possible. For instance, sequential access may aid memory prefetching (note, OpenCL code can be executed on any device, including CPU).
Consider following scheme: each thread would process range
[ get_global_id(0)*delta , (get_global_id(0)+1)*delta )
It will result in fully sequential access.

Flex/Actionscript White to Transparent

I am trying to write something in my Flex 3 application with actionscript that will take an image and when a user clicks a button, it will strip out all the white(ish) pixels and convert them to transparent, I say white(ish) because I have tried exactly white, but I get a lot of artifacts around the edges. I have gotten somewhat close using the following code:
targetBitmapData.threshold(sourceBitmapData, sourceBitmapData.rect, new Point(0,0), ">=", 0xFFf7f0f2, 0x00FFFFFF, 0xFFFFFFFF, true);
However, it also makes red or yellows disappear. Why is it doing this? I'm not exactly sure how to make this work. Is there another function that is better suited for my needs?

A friend and I were trying to do this a while back for a project, and found writing an inline method that does this in ActionScript to be incredibly slow. You have to scan each pixel and do a computation against it, but doing it with PixelBender proved to be lightning fast (if you can use Flash 10, otherwise your stuck with slow AS).
The pixel bender code looks like:
input image4 src;
output float4 dst;
// How close of a match you want
parameter float threshold
<
minValue: 0.0;
maxValue: 1.0;
defaultValue: 0.4;
>;
// Color you are matching against.
parameter float3 color
<
defaultValue: float3(1.0, 1.0, 1.0);
>;
void evaluatePixel()
{
float4 current = sampleNearest(src, outCoord());
dst = float4((distance(current.rgb, color) < threshold) ? 0.0 : current);
}
If you need to do it in AS you can use something like:
function threshold(source:BitmapData, dest:BitmapData, color:uint, threshold:Number) {
dest.lock();
var x:uint, y:uint;
for (y = 0; y < source.height; y++) {
for (x = 0; x < source.width; x++) {
var c1:uint = source.getPixel(x, y);
var c2:uint = color;
var rx:uint = Math.abs(((c1 & 0xff0000) >> 16) - ((c2 & 0xff0000) >> 16));
var gx:uint = Math.abs(((c1 & 0xff00) >> 8) - ((c2 & 0xff00) >> 8));
var bx:uint = Math.abs((c1 & 0xff) - (c2 & 0xff));
var dist = Math.sqrt(rx*rx + gx*gx + bx*bx);
if (dist <= threshold)
dest.setPixel(x, y, 0x00ffffff);
else
dest.setPixel(x, y, c1);
}
}
dest.unlock();
}

You can actually do it without pixelbender and real-time thanks to the inbuilt threshold function :
// Creates a new transparent BitmapData (in case the source is opaque)
var dest:BitmapData = new BitmapData(source.width,source.height,true,0x00000000);
// Copies the source pixels onto it
dest.draw(source);
// Replaces all the pixels greater than 0xf1f1f1 by transparent pixels
dest.threshold(source, source.rect, new Point(), ">", 0xfff1f1f1,0x00000000);
// And here you go ...
addChild(new Bitmap(dest));

it looks like the above code would make a range of colors transparent.
pseudo-code:
for each pixel in targetBitmapData
if pixel's color is >= #FFF7F0F2
change color to #00FFFFFF
something like this will never be perfect, because you will lose any light colors
i would find an online color picker that you can use to see exactly what colors will be altered

The answer in 1 of pixel bender code:
dst = float4((distance(current.rgb, color) < threshold) ? 0.0 : current);
should be:
dst = (distance(current.rgb, color) < threshold) ? float4(0.0) : current;
or
if (distance(current.rgb, color) < threshold)
dst = float4(0.0);
else
dst = float4(current);

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Manual loop unrolling with known maximum size - opencl

Related

Drawing the "Seed of Life" without redrawing anything

OpenCL Optimization

Get ints (of various sizes) from boolean array

boosting parallel reduction OpenCL

Flex/Actionscript White to Transparent

Categories

Resources