OpenCL Programm freezes everytime private variables used - opencl

I got a kernel method, simplified looks like this.
__kernel void calculate (__global float *a, __global float *b, __global int *res) {
int workItem = get_global_id(0); // Syntax may not right, but you get the idea
int found = 0;
for (int i = 0; i<100000000; i++) {
float c = a[i]*3;
float d = b[i]*2;
if (c<d) {
found++;
}
}
res[workItem]= found;
}
So, nothing much but simple calculation and a very big loop, the problem is the programm freezes all the time when i run this code. I have to force reset the computer everytime this happens.
But if i make some changes, like this
if (true) {
found++;
}
Or
if (1<2) {
found++
}
Then the programm works like a charm, and very fast! So i wonder if is there any thing wrong with variables c and d ? I tried to use things like
__private float c= ..;
__private float d= ..;
It didnt work either.
I read the return code of every step while creating programm and kernel, so its not the problem.
What did I do wrong here ?

You have several issues here.
You are running a giant loop, presumably with a single work item, which is exactly what you want to avoid in OpenCL, as the entire concept of parallelism is lost. OpenCL will not magically make your code fast if you don't use it correctly.
Even if you are using multiple work items, you are not making use of the value you retrieve from get_global_id() except for output. But, as they're all using the same input, you'll get the same output for every single work item!
Work items and their associated global IDs are intended to allow you to partition your processing into discrete units, rather than one big monolithic loop. I suggest you look at some tutorials like this one to understand the concept better. Don't start writing your own code until you understand his.
As for why your program freezes your PC, I can only speculate without seeing your host code. Perhaps you are getting buffer overrun?

Related

Using vloadn (opencl) to load unallocated memory

I am using vloadn to load data and as a parameter I pass the range I want to read and it works, but I am wondering what's the behavior of vload4. If this might cause some unexpected issue or I am perfectly safe to do this. An example might be something like this:
__kernel void myKernel(__global float* data_ptr, int size)
{
float4 vec = vload4(0, data_ptr);
float sum = 0.f;
// data_ptr points to an array of 2 floats in global mem
if (size == 2) {
sum += vec.s1;
sum += vec.s0;
}
else if (size == 1) {
sum += vec.s0;
}
}
data_ptr is an array of 2 floats in global memory, but even though I am accessing only those 2 floats, I am loading 4 floats using vload4. The reason I am asking is that I want to use a single vloadn and decide afterwards how much of it I actually want to use and not to use vloadn based on size (e.g. for size==4 use vload4, for size==8 vload8 etc.
If it's still within data_ptr it will be fine; you don't have to use all the data you read. However, if you read past either end of the buffer that data_ptr points to you can have problems (memory read exception, for example, or some other device-dependent error). Note: Check the address alignment requirements for vload to see if you're allowed to read at any address or only multiple of with size.

OpenCL Image reading not working on dynamically allocated array

I'm writing an OpenCL program that applies a convolution matrix on an image. Everything works fine if I store all pixel on an array image[height*width][4] (line 65,commented) (sorry, I speak Spanish, and I code mostly in Spanish). But, since the images I'm working with are really large, I need to allocate the memory dynamically. I execute the code, and I get a Segmentation fault error.
After some poor man's debugging, I found out the problem arises after executing the kernel and reading the output image back into the host, storing the data into the dynamically allocated array. I just can't access the data of the array without getting the error.
I think the problem is the way the clEnqueueReadImage function (line 316) writes the image data into the image array. This array was allocated dynamically, so it has no predefined "structure".
But I need a solution, and I can't find it, nor on my own or on Internet.
The C program and the OpenCL kernel are here:
https://gist.github.com/MigAGH/6dd0fddfa09f5aabe7eb0c2934e58cbe
Don't use pointers to pointers (unsigned char**). Use a regular pointer instead:
unsigned char* image = (unsigned char*)malloc(sizeof(unsigned char)*ancho*alto*4);
Then in the for loop:
for(i=0; i<ancho*alto; i++){
unsigned char* pixel = (unsigned char*)malloc(sizeof(unsigned char)*4);
fread (pixel, 4, 1, bmp_entrada);
image[i*4] = pixel[0];
image[i*4+1] = pixel[1];
image[i*4+2] = pixel[2];
image[i*4+3] = pixel[3];
free(pixel);
}

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Global variable touched by a passed-in parameter becomes unusable

folks!
I pass a struct full of data to my kernel, and I run into the following difficulty using it (very stripped down):
[edit: mac osx / xcode 3.2 on mac book pro; this compile is obviously for cpu]
typedef struct
{
float xoom;
int sizex;
} varholder;
float zX, xd;
__kernel void Harlan( __global varholder * vh )
{
int X = get_global_id(0), Y = get_global_id(1);
zX = ( ( X - vh->sizex/2 ) / vh->xoom + vh->sizex/2 ); // (a)
xd = zX; // (b) BOOM!!
}
after executing line (a), the line marked (b), a simple assignment, gives "LLVM compiler failed to compile a function".
if, however, we do not execute line (a), then line (b) is fine.
So, through my fiddling around a LOT with this, it seems as if it is the assignment statement (a), which uses a passed-in parameter, that messes up the future access of the variable zX. However, of course I need to be able to use the results of calculations further down the line.
I have zX and xd declared at the file level because my helper functions need them.
Any thoughts?
Thanks!
David
p.s. I'm now registered so will be able to upvote and accept answers, which I am sadly unable to do for the last person who helped me (used same username to register, but can't seem to vote on the old post; sorry!).
No, say it ain't so!
I am sincerely hoping that this is not a "correct" answer to my own question. I found on another forum (though not the same question asked!) the following, and I am afraid that it refers to what I'm trying to do:
(quote)
You're doing something the standard prohibits. Section 6.5 says:
'All program scope variables must be declared in the __constant address space.'
In other words, program scope variables cannot be mutable.
(end quote)
... well, tcha!!!! What an astoundingly inconvenient restriction! I'm sure there's reasoning behind it.
[edit: Not At All inconvenient! it was in fact astonishingly easy to work around, given a fresh start the next morning. (And no alcohol.)]
You guys & dolls all knew this, right, and didn't have the heart to tell me?...

How to reverse a QList?

I see qCopy, and qCopybackward but neither seems to let me make a copy in reverse order. qCopybackward only copies it in reverse order, but keeps the darn elements in the same order! All I want to do is return a copy of the list in reverse order. There has to be a function for that, right?
If you don't like the QTL, just use the STL. They might not have a Qt-ish API, but the STL API is rock-stable :) That said, qCopyBackward is just std::copy_backward, so at least they're consistent.
Answering your question:
template <typename T>
QList<T> reversed( const QList<T> & in ) {
QList<T> result;
result.reserve( in.size() ); // reserve is new in Qt 4.7
std::reverse_copy( in.begin(), in.end(), std::back_inserter( result ) );
return result;
}
EDIT 2015-07-21: Obviously (or maybe not), if you want a one-liner (and people seem to prefer that, looking at the relative upvotes of different answers after five years) and you have a non-const list the above collapses to
std::reverse(list.begin(), list.end());
But I guess the index fiddling stuff is better for job security :)
Reverse your QList with a single line:
for(int k = 0; k < (list.size()/2); k++) list.swap(k,list.size()-(1+k));
[Rewrite from original]
It's not clear if OP wants to know "How [do I] reverse a QList?" or actually wants a reversed copy. User mmutz gave the correct answer for a reversed copy, but if you just want to reverse the QList in place, there's this:
#include <algorithm>
And then
std::reverse(list.begin(), list.end());
Or in C++11:
std::reverse(std::begin(list), std::end(list));
The beauty of the C++ standard library (and templates in general) is that the algorithms and containers are separate. At first it may seem annoying that the standard containers (and to a lesser extent the Qt containers) don't have convenience functions like list.reverse(), but consider the alternatives: Which is more elegant: Provide reverse() methods for all containers, or define a standard interface for all containers that allow bidirectional iteration and provide one reverse() implementation that works for all containers that support bidirectional iteration?
To illustrate why this is an elegant approach, consider the answers to some similar questions:
"How do you reverse a std::vector<int>?":
std::reverse(std::begin(vec), std::end(vec));
"How do you reverse a std::deque<int>?":
std::reverse(std::begin(deq), std::end(deq));
What about portions of the container?
"How do you reverse the first seven elements of a QList?": Even if the QList authors had given us a convenience .reverse() method, they probably wouldn't have given us this functionality, but here it is:
if (list.size() >= 7) {
std::reverse(std::begin(list), std::next(std::begin(list), 7));
}
But it gets better: Because the iterator interface is the same as C pointer syntax, and because C++11 added the free std::begin() and std::end functions, you can do these:
"How do you reverse an array float x[10]?":
std::reverse(std::begin(x), std::end(x));
or pre C++11:
std::reverse(x, x + sizeof(x) / sizeof(x[0]));
(That is the ugliness that std::end() hides for us.)
Let's go on:
"How do you reverse a buffer float* x of size n?":
std::reverse(x, x + n);
"How do you reverse a null-terminated string char* s?":
std::reverse(s, s + strlen(s));
"How do you reverse a not-necessarily-null-terminated string char* s in a buffer of size n?":
std::reverse(s, std::find(s, s + n, '\0'));
Note that std::reverse uses swap() so even this will perform pretty much as well as it possibly could:
QList<QList<int> > bigListOfBigLists;
....
std::reverse(std::begin(bigListOfBigLists), std::end(bigListOfBigLists));
Also note that these should all perform as well as a hand-written loop since, when possible, the compiler will turn these into pointer arithmetic. Also, you can't cleanly write a reusable, generic, high-performance reverse function like this C.
#Marc Jentsch's answer is good. And if you want to get an additional 30% performance boost you can change his one-liner to:
for(int k=0, s=list.size(), max=(s/2); k<max; k++) list.swap(k,s-(1+k));
One a ThinkPad W520 with a QList of 10 million QTimers I got these numbers:
reversing list stack overflow took 194 ms
reversing list stack overflow with max and size took 136 ms
The boost is a result of
the expression (list.size()/2) being calculated only once when initializing the loop and not after every step
the expression list.size() in swap() is called only once when initializing the loop and not after every step
You can use the Java style iterator. Complete example here (http://doc.qt.digia.com/3.2/collection.html). Look for the word "reverse".
QList<int> list; // initial list
list << 1;
list << 2;
list << 3;
QList<int> rlist; // reverse list+
QListIterator<int> it(list);
while (it.hasPrevious()) {
rlist << it.previous();
}
Reversing a QList is going to be O(n) however you do it, since QList isn't guaranteed to have its data stored contiguously in memory (unlike QVector). You might consider just traversing the list in backwards order where you need to, or use something like a QStack which lets you retrieve the elements in the opposite order they were added.
For standard library lists it would look like this
std::list<X> result;
std::copy(list.rbegin(), list.rend(), std::back_inserter(result));
Unfortunately, Qt doesn't have rbegin and rend functions that return reverse iterators (the ones that go from the end of the container to its begnning). You may write them, or you can just write copying function on your own -- reversing a list is a nice excersize. Or you can note that QList is actually an array, what makes writing such a function trivial. Or you can convert the list to std::list, and use rbegin and rend. Choose whatever you like.
As of Qt 5.14 (circa 2020), QList provides a constructor that takes iterators, so you can just construct a reversed copy of the list with the reverse iterators of the source:
QList<int> backwards(forwards.rbegin(), forwards.rend());
Or if you want to be able to inline it, more generically (replace QList<I> with just I if you want to be super duper generic):
template <typename I> QList<I> reversed (const QList<I> &forwards) {
return QList<I>(forwards.rbegin(), forwards.rend());
}
Which lets you do fun one-liners with temporaries like:
QString badDay = "reporter covers whale exploding";
QString worseDay = reversed(badDay.split(' ')).join(' ');

Resources