OpenCL result changes depending on result of printf? What? - vector

OpenCL kernel crunches some numbers. This particular kernel then searches an array of 8 bit char4 vectors for a matching string of numbers. For example, array holds 3 67 8 2 56 1 3 7 8 2 0 2 - the kernel loops over that (actual string is 1024 digits long) and searches for 1 3 7 8 2 and "returns" data letting the host program know it found a match.
In an combo learning exercise/programming experiment I wanted to see if I could loop over an array and search for a range of values, where the array is not just char values, but char4 vectors, WITHOUT using a single if statement in the kernel. Two reasons:
1: After half an hour of getting compile errors I realized that you cannot do:
if(charvector[3] == searchvector[0])
Because some may match and some may not. And 2:
I'm new to OpenCL and I've read a lot about how branches can hurt a kernel's speed, and if I understand the internals of kernels correctly, some math may actually be faster than if statements. Is that the case?
Anyway... first, the kernel in question:
void search(__global uchar4 *rollsrc, __global uchar *srch, char srchlen)
{
size_t gx = get_global_id(0);
size_t wx = get_local_id(0);
__private uint base = 0;
__local uchar4 queue[8092];
__private uint chunk = 8092 / get_local_size(0);
__private uint ctr, start, overlap = srchlen-1;
__private int4 srchpos = 0, srchtest = 0;
uchar4 searchfor;
event_t e;
start = max((int)((get_group_id(0)*32768) - overlap), 0);
barrier(CLK_LOCAL_MEM_FENCE);
e = async_work_group_copy(queue, rollsrc+start, 8092, 0);
wait_group_events(1, &e);
for(ctr = 0; ctr < chunk+overlap; ctr++) {
base = min((uint)((get_group_id(0) * chunk) + ctr), (uint)((N*32768)-1));
searchfor.x = srch[max(srchpos.x, 0)];
searchfor.y = srch[max(srchpos.y, 0)];
searchfor.z = srch[max(srchpos.z, 0)];
searchfor.w = srch[max(srchpos.w, 0)];
srchpos += max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
srchpos = max(srchpos, 0);
srchtest = clamp(srchpos-(srchlen-1), 0, 1) << 31;
srch[0] |= (any(srchtest) * 255);
// if(get_group_id(0) == 0 && get_local_id(0) == 0)
// printf("%u: %v4u %v4u\n", ctr, srchpos, srchtest);
}
barrier(CLK_LOCAL_MEM_FENCE);
}
There's extra unneeded code in there, this was a copy from a previous kernel, and I havent cleaned up the extra junk yet. That being said.. in short and in english, how the math based if statement works:
Since I need to search for a range, and I'm searching a vector, I first set a char4 vector (searchfor) to have elements xyzw individually set to the number I am searching for. It's done individually because each of xyz and w hold a different stream, and the search counter - how many matches in a row we've had - will be different for each of the members of the vector. I'm sure there's a better way to do it than what I did. Suggestions?
So then, an int4 vector, searchpos, which holds the current position in the search array for each of the 4 vector positions, gets this added to it:
max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
What this does: Take the ABS difference between the current location in the target queue (queue) and the searchfor vector set in the previous 4 lines. A vector is returned where each member will have either a positive number (not a match) or zero (a match - no difference).
It's converted to int4 (as uchar cannot be negative) then multipled by -100, then run through max(x,-100). Now the vector is either -100, or 0. We OR it with 1 and now it's -99 or 1.
End result: searchpos either increments by 1 (a match), or is reduced by 99, resetting any previous partial match increments. (Searches can be up to 96 characters long - there exists a chance to match 91, then miss, so it has to be able to wipe that all out). It is then max'ed with 0 so any negative result is clamped to zero. Again - open to suggestions to make that more efficient. I realized as I was writing this I could probably use addition with saturation to remove some of the max statements.
The last part takes the current srchpos, which now equals the number of consecutive matches, subtracts 1 less than the length of the search string, then clamps it to 0-1, thus ending up with either a 1 - a full match, or 0. We bit shift this << 31. Result is 0, or 0x8000000. Put this into srchtest.
Lastly, we bitwise OR the first character of the search string with the result of any(srchtest) * 255 - it's one of the few ways (I'm aware of) to test something across a vector and return a single integer from it. (any() returns 1 if any member of the vector has it's MSB set - which we set in the line above)
End result? srch[0] is unchanged, or, in the case of a match, it's set to 0xff. When the kernel returns, the host can read back srch from the buffer. If the first character is 0xff, we found a match.
It probably has too many steps and can be cleaned up. It also may be less efficient than just doing 4 if checks per loop. Not sure.
But, after this massive post, the thing that has me pulling my hair out:
When I UNCOMMENT the two lines at the end that prints debug information, the script works. This is the end of the output on my terminal window as I run it:
36: 0,0,0,0 0,0,0,0
37: 0,0,0,0 0,0,0,0
38: 0,0,0,0 0,0,0,0
39: 0,0,0,0 0,0,0,0
Search = 613.384 ms
Positive
Done read loop: -1 27 41
Positive means the string was found. The -1 27 41 is the first 3 characters of the search string, the first being set to -1 (signed char on the host side).
Here's what happens when I comment out the printf debugging info:
Search = 0.150 ms
Negative
Done read loop: 55 27 41
IT DOES NOT FIND IT. What?! How is that possible? Of course, I notice that the script execution time jumps from .15ms to 600+ms because of the printf, so I think, maybe it's somehow returning and reading the data BEFORE the script ends, and the extra delay from the printf gives it a pause. So I add a barrier(CLK_LOCAL_MEM_FENCE); to the end, thinking that will make sure all threads are done before returning. Nope. No effect. I then add in a 2 second sleep on the host side, after running the kernel, after running clFinish, and before running clReadBuffer.
NOPE! Still Negative. But I put the printf back in - and it works. How is that possible? Why? Does anyone have any idea? This is the first time I've had a programming bug that baffled me to the point of pulling hair out, because it makes absolutely zero sense. The work items are not clashing, they each read their own block, and even have an overlap in case the search string is split across two work item blocks.
Please - save my hair - how can a printf of irrelevant data cause this to work and removing it causes it to not?
Oh - one last fun thing: If I remove the parameters from the printf - just have it print text like "grr please work" - the kernel returns a negative, AND, nothing prints out. The printf is ignored.
What the heck is going on? Thanks for reading, I know this was absurdly long.

For anyone referencing this question in the future, the issue was caused by my arrays being read out of bounds. When that happens, all heck breaks loose and all results are unpredictable.
Once I fixed the work and group size and made sure I was not exceeding the memory bounds, it worked as expected.

Related

Unable to understand the recursion tree for the problem, printing all permutations of a given string

This is with respect to the problem, printing all permutations of a given string ABC. As far as I have understood, the program first prints ABC, then the string gets reset to ABC (which is the same as the first permutation of the string in this case) through backtracking, and then when i=1 and j=2, letters B and C are swapped, following which i gets incremented to 2 and is passed to the permute function where the second permutation ACB is printed on the screen. Following this, backtracking again takes place and we get back ABC.
I'm unable to understand how the next permutation BAC gets printed. For this the value of i would have to be reset to 0 (previously i=2 when we print out ACB) and the value of j should equal 1. (previously j=2 when the swapping of B and C takes place). How does i=0 and j=1, when there is no line of code in this program that does the same?
P.S: The code works fine, but I'm unable understand why it works the way it works. If anyone could trace the code step by step for each permutation it prints on the screen, that would be really helpful!
Here is the code:
void permute(String s, int i=0){
if(i==s.length()-1){
System.out.println(s);
return;
}
for(int j=i; j<s.length(); j++){
swap(s[i], s[j]);
permute(s, i+1);
swap(s[i], s[j]);
}
}

Using vloadn (opencl) to load unallocated memory

I am using vloadn to load data and as a parameter I pass the range I want to read and it works, but I am wondering what's the behavior of vload4. If this might cause some unexpected issue or I am perfectly safe to do this. An example might be something like this:
__kernel void myKernel(__global float* data_ptr, int size)
{
float4 vec = vload4(0, data_ptr);
float sum = 0.f;
// data_ptr points to an array of 2 floats in global mem
if (size == 2) {
sum += vec.s1;
sum += vec.s0;
}
else if (size == 1) {
sum += vec.s0;
}
}
data_ptr is an array of 2 floats in global memory, but even though I am accessing only those 2 floats, I am loading 4 floats using vload4. The reason I am asking is that I want to use a single vloadn and decide afterwards how much of it I actually want to use and not to use vloadn based on size (e.g. for size==4 use vload4, for size==8 vload8 etc.
If it's still within data_ptr it will be fine; you don't have to use all the data you read. However, if you read past either end of the buffer that data_ptr points to you can have problems (memory read exception, for example, or some other device-dependent error). Note: Check the address alignment requirements for vload to see if you're allowed to read at any address or only multiple of with size.

N-Queens example program strange output

I try the code from the squeen.icl example. When I try it with BoardSize :== 11, there is no problem. But when I change it to 12, the output is [. Why? How to fix that?
module squeen
import StdEnv
BoardSize :== 12
Queens::Int [Int] [[Int]] -> [[Int]]
Queens row board boards
| row>BoardSize = [board : boards]
| otherwise = TryCols BoardSize row board boards
TryCols::Int Int [Int] [[Int]] -> [[Int]]
TryCols 0 row board boards = boards
TryCols col row board boards
| Save col 1 board = TryCols (col-1) row board queens
| otherwise = TryCols (col-1) row board boards
where queens = Queens (row+1) [col : board] boards
Save::!Int !Int [Int] -> Bool
Save c1 rdiff [] = True
Save c1 rdiff [c2:cols]
| cdiff==0 || cdiff==rdiff || cdiff==0-rdiff = False
| otherwise = Save c1 (rdiff+1) cols
where cdiff = c1 - c2
Start::(Int,[Int])
Start = (length solutions, hd solutions)
where solutions = Queens 1 [] []
This is because you're running out of space on the heap. By default, the heap of Clean programs is set to 2M. You can change this, of course. When using clm from the command line, you can add -h 4M to its command line or to the command line of the clean program itself. If you're using the Clean IDE, you can change the heap size through Project Options, Application.
The reason that ( is still printed (which is what I get, rather than [), is the following. A Clean program will output as much of its output as possible, rather than waiting until the whole output is known. This means, for example, that a simple line as Start = [0..] will spam your terminal, not wait until the whole infinite list is in memory and then print it. In the case of squeen.icl, Clean sees that the result of Start will be a tuple, and therefore prints the opening brace directly. However, when trying to compute the elements of the tuple (length solutions and hd solutions), the heap fills up, making the program terminate.
I don't know what it looks like when you get a full heap on Windows, but on Linux(/Mac), it looks like this:
$ clm squeen -o squeen && ./squeen -h 2M
Linking squeen
Heap full.
Execution: 0.13 Garbage collection: 0.03 Total: 0.16
($
Note that the tuple opening brace is on the last line. So, when using a terminal it is quite easy to spot this error.
Interestingly, since length exploits tail recursion, the first element of the tuple can be computed, even with a small heap (you can try this by replacing the second element with []). Also the second element of the tuple can be computed on a small heap (replace the first element with 0).
The point is that the length is computed before the head, since it has to be printed the first. While with a normal length call parts of the list are garbage collected (after iterating over the first 100 elements, they can be discarded, allowing for smaller heap usage), the hd call makes sure that the first element of the list is not discarded. If the first element is not discarded, than neither can the second be, the third, etc. Hence, the whole list is kept in memory, while this is not actually necessary. Flipping the length and hd calls solve the issue:
Start :: ([Int], Int)
Start = (hd solutions, length solutions)
where solutions = Queens 1 [] []
Now, after hd has been called, there is no reason to keep the whole list in memory, so length can discard elements it has iterated over, and the heap doesn't fill up.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Pointer Trouble

I was trying some basic pointer manipulation and have a issue i would like clarified. Here is the code snippet I am referring to
int arr[3] = {0};
*(arr+0) = 12;
*(arr+1) = 24;
*(arr+2) = 74;
*(arr+3) = 55;
cout<<*(arr+3)<<"\t"<<(long)(arr+3)<<endl;
//cout<<"Address of array arr : "<<arr<<endl;
cout<<(long)(arr+0)<<"\t"<<(long)(arr+1)<<"\t"<<(long)(arr+2)<<endl;;
for(int i=0;i<4;i++)
cout<<*(arr+i)<<"\t"<<i<<"\t"<<(long)(arr+i)<<endl;
//*(arr+3) = 55;
cout<<*(arr+3)<<endl<<endl;
My problem is:
When I try to acces arr+3 outside the for-loop , I get the desired value 55 printed. But when I try to access it through the for loop, I get some different value(3 in this case). After the for loop, it is printing the value as 4. Could someone explain to me what is happening? Thanks in advance..
You have created an array of size 3 and you are trying to access the 4th element. The outcome is therefore undefined.
Since you allocate the array in the stack, the first time you try to write the 4th element, you are actually writing beyond the space that was allocated for the stack. In Debug mode this will work, but in Release your program will probably crash.
The second time you are reading the value at the 4th place you are reading the value 4. This makes sense, as the compiler has allocated the stack space after the array for variable i, which after the loop has finished executing will have the value 4.
As array has been defined with 3 elements, data will be stored sequentially like 12,24,74. When you assign 55 for 4th element, it is stored somewhere else in memory, not sequentially. First time, Compiler prints it correctly, but then it is not able to handle memory so it prints garbage value.

Resources