I'm writing an OpenCL program that applies a convolution matrix on an image. Everything works fine if I store all pixel on an array image[height*width][4] (line 65,commented) (sorry, I speak Spanish, and I code mostly in Spanish). But, since the images I'm working with are really large, I need to allocate the memory dynamically. I execute the code, and I get a Segmentation fault error.
After some poor man's debugging, I found out the problem arises after executing the kernel and reading the output image back into the host, storing the data into the dynamically allocated array. I just can't access the data of the array without getting the error.
I think the problem is the way the clEnqueueReadImage function (line 316) writes the image data into the image array. This array was allocated dynamically, so it has no predefined "structure".
But I need a solution, and I can't find it, nor on my own or on Internet.
The C program and the OpenCL kernel are here:
https://gist.github.com/MigAGH/6dd0fddfa09f5aabe7eb0c2934e58cbe
Don't use pointers to pointers (unsigned char**). Use a regular pointer instead:
unsigned char* image = (unsigned char*)malloc(sizeof(unsigned char)*ancho*alto*4);
Then in the for loop:
for(i=0; i<ancho*alto; i++){
unsigned char* pixel = (unsigned char*)malloc(sizeof(unsigned char)*4);
fread (pixel, 4, 1, bmp_entrada);
image[i*4] = pixel[0];
image[i*4+1] = pixel[1];
image[i*4+2] = pixel[2];
image[i*4+3] = pixel[3];
free(pixel);
}
OpenCL kernel crunches some numbers. This particular kernel then searches an array of 8 bit char4 vectors for a matching string of numbers. For example, array holds 3 67 8 2 56 1 3 7 8 2 0 2 - the kernel loops over that (actual string is 1024 digits long) and searches for 1 3 7 8 2 and "returns" data letting the host program know it found a match.
In an combo learning exercise/programming experiment I wanted to see if I could loop over an array and search for a range of values, where the array is not just char values, but char4 vectors, WITHOUT using a single if statement in the kernel. Two reasons:
1: After half an hour of getting compile errors I realized that you cannot do:
if(charvector[3] == searchvector[0])
Because some may match and some may not. And 2:
I'm new to OpenCL and I've read a lot about how branches can hurt a kernel's speed, and if I understand the internals of kernels correctly, some math may actually be faster than if statements. Is that the case?
Anyway... first, the kernel in question:
void search(__global uchar4 *rollsrc, __global uchar *srch, char srchlen)
{
size_t gx = get_global_id(0);
size_t wx = get_local_id(0);
__private uint base = 0;
__local uchar4 queue[8092];
__private uint chunk = 8092 / get_local_size(0);
__private uint ctr, start, overlap = srchlen-1;
__private int4 srchpos = 0, srchtest = 0;
uchar4 searchfor;
event_t e;
start = max((int)((get_group_id(0)*32768) - overlap), 0);
barrier(CLK_LOCAL_MEM_FENCE);
e = async_work_group_copy(queue, rollsrc+start, 8092, 0);
wait_group_events(1, &e);
for(ctr = 0; ctr < chunk+overlap; ctr++) {
base = min((uint)((get_group_id(0) * chunk) + ctr), (uint)((N*32768)-1));
searchfor.x = srch[max(srchpos.x, 0)];
searchfor.y = srch[max(srchpos.y, 0)];
searchfor.z = srch[max(srchpos.z, 0)];
searchfor.w = srch[max(srchpos.w, 0)];
srchpos += max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
srchpos = max(srchpos, 0);
srchtest = clamp(srchpos-(srchlen-1), 0, 1) << 31;
srch[0] |= (any(srchtest) * 255);
// if(get_group_id(0) == 0 && get_local_id(0) == 0)
// printf("%u: %v4u %v4u\n", ctr, srchpos, srchtest);
}
barrier(CLK_LOCAL_MEM_FENCE);
}
There's extra unneeded code in there, this was a copy from a previous kernel, and I havent cleaned up the extra junk yet. That being said.. in short and in english, how the math based if statement works:
Since I need to search for a range, and I'm searching a vector, I first set a char4 vector (searchfor) to have elements xyzw individually set to the number I am searching for. It's done individually because each of xyz and w hold a different stream, and the search counter - how many matches in a row we've had - will be different for each of the members of the vector. I'm sure there's a better way to do it than what I did. Suggestions?
So then, an int4 vector, searchpos, which holds the current position in the search array for each of the 4 vector positions, gets this added to it:
max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
What this does: Take the ABS difference between the current location in the target queue (queue) and the searchfor vector set in the previous 4 lines. A vector is returned where each member will have either a positive number (not a match) or zero (a match - no difference).
It's converted to int4 (as uchar cannot be negative) then multipled by -100, then run through max(x,-100). Now the vector is either -100, or 0. We OR it with 1 and now it's -99 or 1.
End result: searchpos either increments by 1 (a match), or is reduced by 99, resetting any previous partial match increments. (Searches can be up to 96 characters long - there exists a chance to match 91, then miss, so it has to be able to wipe that all out). It is then max'ed with 0 so any negative result is clamped to zero. Again - open to suggestions to make that more efficient. I realized as I was writing this I could probably use addition with saturation to remove some of the max statements.
The last part takes the current srchpos, which now equals the number of consecutive matches, subtracts 1 less than the length of the search string, then clamps it to 0-1, thus ending up with either a 1 - a full match, or 0. We bit shift this << 31. Result is 0, or 0x8000000. Put this into srchtest.
Lastly, we bitwise OR the first character of the search string with the result of any(srchtest) * 255 - it's one of the few ways (I'm aware of) to test something across a vector and return a single integer from it. (any() returns 1 if any member of the vector has it's MSB set - which we set in the line above)
End result? srch[0] is unchanged, or, in the case of a match, it's set to 0xff. When the kernel returns, the host can read back srch from the buffer. If the first character is 0xff, we found a match.
It probably has too many steps and can be cleaned up. It also may be less efficient than just doing 4 if checks per loop. Not sure.
But, after this massive post, the thing that has me pulling my hair out:
When I UNCOMMENT the two lines at the end that prints debug information, the script works. This is the end of the output on my terminal window as I run it:
36: 0,0,0,0 0,0,0,0
37: 0,0,0,0 0,0,0,0
38: 0,0,0,0 0,0,0,0
39: 0,0,0,0 0,0,0,0
Search = 613.384 ms
Positive
Done read loop: -1 27 41
Positive means the string was found. The -1 27 41 is the first 3 characters of the search string, the first being set to -1 (signed char on the host side).
Here's what happens when I comment out the printf debugging info:
Search = 0.150 ms
Negative
Done read loop: 55 27 41
IT DOES NOT FIND IT. What?! How is that possible? Of course, I notice that the script execution time jumps from .15ms to 600+ms because of the printf, so I think, maybe it's somehow returning and reading the data BEFORE the script ends, and the extra delay from the printf gives it a pause. So I add a barrier(CLK_LOCAL_MEM_FENCE); to the end, thinking that will make sure all threads are done before returning. Nope. No effect. I then add in a 2 second sleep on the host side, after running the kernel, after running clFinish, and before running clReadBuffer.
NOPE! Still Negative. But I put the printf back in - and it works. How is that possible? Why? Does anyone have any idea? This is the first time I've had a programming bug that baffled me to the point of pulling hair out, because it makes absolutely zero sense. The work items are not clashing, they each read their own block, and even have an overlap in case the search string is split across two work item blocks.
Please - save my hair - how can a printf of irrelevant data cause this to work and removing it causes it to not?
Oh - one last fun thing: If I remove the parameters from the printf - just have it print text like "grr please work" - the kernel returns a negative, AND, nothing prints out. The printf is ignored.
What the heck is going on? Thanks for reading, I know this was absurdly long.
For anyone referencing this question in the future, the issue was caused by my arrays being read out of bounds. When that happens, all heck breaks loose and all results are unpredictable.
Once I fixed the work and group size and made sure I was not exceeding the memory bounds, it worked as expected.
I was trying to write a test-bench code which used an associative array, and was seeing that in one case accessing its values wasn't working as a comb logic, but when moved inside a sequential block it was working fine.
Example code :
Here "value" was getting assigned as "x" always, but once I moved it inside the #posedge block, I was seeing it assigned the right value (1 once "dummy" got assigned).
Can someone explain why this is so ?
logic dummy[logic[3:0]];
logic value;
always # (posedge clk)
begin
if (reset == 1'b1) begin
count <= 0;
end else if ( enable == 1'b1) begin
count <= count + 1;
end
if(enable) begin
if(!dummy.exists(count))
begin
dummy[count] = 1;
$display (" Setting for count = %d ", count);
end
end
end
always_comb begin
if(dummy.exists(count)) begin
value = dummy[count];
$display("Value = %d",value);
end else begin // [Update : 1]
value = 0;
end
end
[UPDATE : 1 - code updated to have else block]
The question is a bit misleading, actually the if(dummy.exist(count)) seems to be failing when used inside comb logic, but passes when inside seq logic (and since "value" is never assigned in this module, it goes to "x" in my simulation - so edited with an else block) - but this result was on VCS simulator.
EDA-playground link : http://www.edaplayground.com/x/6eq
- Here it seems to be working as normally expected i.e if(dummy.exists(count)) is passing irrespective of being inside always_comb or always #(posedge)
Result in VCS :
[when used as comb logic - value never gets printed]
Value = 0
Applying reset Value = 0
Came out of Reset
Setting for count = 0
Setting for count = 1
Setting for count = 2
Setting for count = 3
Setting for count = 4
Terminating simulation
Simulation Result : PASSED
And value gets printed as "1" when the if(dummy.exist(count)) and assignment is moved inside seq block.
Your first always block contains both blocking and non-blocking assignments, which VCS may be allowing because the always keyword used to be able to specify combinational logic in verilog (via always #(*)). This shouldn't account for the error, but is bad style.
Also the first line of your program is strange, what are you trying to specify? Value is a bit, but dummy is not, so if you try doing dummy[count] = 1'b1, you'll also pop out an error (turn linting on with +lint=all). If you're trying to make dummy an array of 4 bit values, your syntax is off, and then value has the wrong size as well.
Try switching the first always to an explicit always_ff, this should give you a warning/error in VCS. Also, you can always look at the waveform, compile with +define+VPD and use gtkwave (freeware). This should let you see exactly what's happening.
Please check your VCS compilation message and see if there is any warning related to SV new always_comb statement. Some simulators might have issues with the construct or do not support that usage when you inferred "dynamic types" in the sensitivity list. I tried with Incisiv (ncverilog) and it is also OK.
I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.
Not used memcpy much but here's my code that doesn't work.
memcpy((PVOID)(enginebase+0x74C9D),(void *)0xEB,2);
(enginebase+0x74C9D) is a pointer location to the address of the bytes that I want to patch.
(void *)0xEB is the op code for the kind of jmp that I want.
Only problem is that this crashes the instant that the line tries to run, I don't know what I'm doing wrong, any incite?
The argument (void*)0xEB is saying to copy memory from address 0xEB; presumably you want something more like
unsigned char x = 0xEB;
memcpy((void*)(enginebase+0x74c9d), (void*)&x, 2);
in order to properly copy the value 0xEB to the destination. BTW, is 2 the right value to copy a single byte to program memory? Looks like it should be 1, since you're copying 1 byte. I'm also under the assumption that you can't just do
((char*)enginebase)[0x74c9d] = 0xEB;
for some reason? (I don't have any experience overwriting program memory intentionally)
memcpy() expect two pointers for the source and destination buffers. Your second argument is not a pointer but rather the data itself (it is the opcode of jnz, as you described it). If I understand correctly what you are trying to do, you should set an array with the opcode as its contetns, and provide memcpy() with the pointer to that array.
The program crashes b/c you try to reference a memory location out of your assigned space (address 0xEB).