OpenCL oil painting - opencl

I want to implement oil painting filter in OpenCL,but the output image is always black and I cannot figure out why.
Here's the kernel code:
__kernel void oil_painting(__global const char* R,__global const char* G,__global const char* B,
__global char* r,__global char* g,__global char* b)
{
int i=get_global_id(0);
int j=get_global_id(1);
int i1,j1,k;
int avgR[256],avgG[256],avgB[256],intensity_count[256];
int max_pixels=0,max_intensity=0,current_intensity;
for (i1=0;i1<4;i1++) {
for (j1=0;j1<4;j1++) {
current_intensity=(((R[(i+i1)*512+j+j1]+
G[(i+i1)*512+j+j1]+
B[(i+i1)*512+j+j1])/3)*70)/255;
intensity_count[current_intensity]++;
if (intensity_count[current_intensity]>max_pixels) {
max_pixels=intensity_count[current_intensity];
max_intensity=current_intensity;
}
avgR[current_intensity]+=R[(i+i1)*512+j+j1];
avgG[current_intensity]+=G[(i+i1)*512+j+j1];
avgB[current_intensity]+=B[(i+i1)*512+j+j1];
}
}
r[i*512+j]=min(255,max(0,avgR[max_intensity]/max_pixels));
g[i*512+j]=min(255,max(0,avgG[max_intensity]/max_pixels));
b[i*512+j]=min(255,max(0,avgB[max_intensity]/max_pixels));
}

Code snippets like the following are going to get you into a lot of trouble:
current_intensity=(((R[(i+i1)*512+j+j1]+
G[(i+i1)*512+j+j1]+
B[(i+i1)*512+j+j1])/3)*70)/255;
Consider what happens for a pixel of <127,127,127>:
127 + 127 + 127 = 125 (truncated because `char` is only 8 bytes...)
125 / 3 = 41
41 * 70 = 54 (truncated because `char` is only 8 bytes...)
54 / 255 = 0 (this will always equal 0!)
So intensity_count will only ever have its 0-th index incremented, and nothing else.
Casting everything to int might fix this problem.
current_intensity=((((int)R[(i+i1)*512+j+j1]+
(int)G[(i+i1)*512+j+j1]+
(int)B[(i+i1)*512+j+j1])/3)*70)/255;
New output:
127 + 127 + 127 = 381
381 / 3 = 127
127 * 70 = 8890
8890 / 255 = 34
But you've now got a new problem: what if the values are any higher than 127? Suppose we change this to use <200, 200, 200> instead?
-56 + -56 + -56 = -168 (`char` only has a range in [-128, 127]! You're overflowing!)
-168 / 3 = -56
-56 * 70 = -3920
-3920 / 255 = -15
And now you've crashed your program because either you're going to attempt to access index -15, which is illegal, or you're going to attempt to access index 2^64 - 15 - 1, which is going to still be illegal. Either way, you're going to get bad results.
The simplest solution is to change your kernel arguments to global uchar * instead of global char *, and then make sure that any and all arithmetic is casted upwards to int or long to ensure that overflow doesn't take place.

Related

Analog to Digital input scaling equation works in codeblocks but not on Microcontroller

I'm so lost on how to fix this, it should be so simple. I'm using a pic16F1526 and trying to scale the analog to digital reading from 0-255 to 50-100 roughly. I am using this equation
result = ((user_input + 200) * 200) / 800;
In code blocks and on my calculator it works at all numbers from 0-255 and it works perfectly whether I use 8 bit, 16 bit variables in code bloacks.
I've already verified that the AtoD input is working correctly sending the data to the UART. Even if I enter static numbers in place of the sample I get weird results.
When the acd reads a 255 or I enter a 255 the equation gives me a 31 in decimal instead of 100 like it's supposed to. The only thing I can think of is something is getting messed up in the way an 8 bit PIC does it's math since it's an a bit micro.
Sounds like you are getting the correct results on in codeblack because of integer promotion and getting the incorrect results in the hardware because of variable overflow.
uint8_t Can contain 0 to 255
int8_t Can contain -126 125
uint16_t Can contain 0 to 65635
...
Assuming you have uint16_t, the micro's math will go as follows:
((255 + 200) * 200) / 800
(455 * 200) / 800 : 455 * 200 Overflows the 16 bit variable!
( 25464 ) / 800: Note that 91000 & 0xFFFF == 25464
31
You can work around this issue by simplifying your equation :
(user_input + 200) / 4 is equivalent to ((user_input + 200) * 200) / 800 and will not overflow at 16 bits although your accuracy is not very high as ImaginaryHuman072889 pointed out.
If I understand your question correctly, you want to linearly map the numbers 0-255 to the numbers 50-100.
Back to good old y = mx + b algebra.
When x = 0, y = 50. Therefore:
y = mx + b
50 = m*0 + b
b = 50
When x = 255, y = 100. Therefore:
y = mx + 50
100 = m*255 +50
m*255 = 50
m = 50/255 = 10/51
Therefore, the precise answer is:
y = (10/51)*x + 50
On a side note, I have no idea how you got the result of 31 when plugging in 100 into your formula. See below.
(255+200)*200/800 = 113.75

OpenCL Intel Iris Integrated Graphics exits with Abort Trap 6: Timeout Issue

I am attempting to write a program that executes Monte Carlo simulations using OpenCL. I have run into an issue involving exponentials. When the value of the variable steps becomes large, approximately 20000, the calculation of the exponent fails unexpectedly, and the program quits with "Abort Trap: 6". This seems to be a bizarre error given that steps should not affect memory allocation. I have tried setting normal, alpha, and beta to 0 but this does not resolve the problem however commenting out the exponent and replacing it with the constant 1 seems to fix the problem. I have run my code on an AWS GPU instance and it does not run into any issues. Does anybody have any ideas as to why this might be a problem on an integrated graphics card?
SOLUTION
Execute the kernel multiple times over a smaller ranges to keep kernel execution time under 5 seconds
Code Snippet
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
static uint MWC64X(uint2 *state) {
enum { A = 4294883355U };
uint x = (*state).x, c = (*state).y;
uint res = x ^ c;
uint hi = mul_hi(x, A);
x = x * A + c;
c = hi + (x < c);
*state = (uint2)(x, c);
return res;
}
__kernel void discreteMonteCarloKernel(...) {
float cumulativeWalk = stockPrice;
float currentValue = stockPrice;
...
uint n = get_global_id(0);
uint2 seed2 = (uint2)(n, seed);
uint random1 = MWC64X(&seed2);
uint2 seed3 = (uint2)(random1, seed);
uint random2 = MWC64X(&seed3);
float alpha = (interestRate - 0.5 * sigma * sigma) * dt;
float beta = sigma * sqrt(dt);
float u1;
float u2;
float a;
float b;
float normal;
for (int j = 0; j < steps; j++) {
random1 = MWC64X(&seed2);
if (random1 == 0) {
random1 = MWC64X(&seed2);
}
random2 = MWC64X(&seed3);
u1 = (float)random1 / (float)0xffffffff;
u2 = (float)random2 / (float)0xffffffff;
a = sqrt(-2 * log(u1));
b = 2 * M_PI * u2;
normal = a * sin(b);
exponent = exp(alpha + beta * normal);
currentValue = currentValue * exponent;
cumulativeWalk += currentValue;
...
}
Problem Report
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Application Specific Information:
abort() called
Application Specific Signatures:
Graphics hardware encountered an error and was reset: 0x00000813
Thread 0 Crashed:: Dispatch queue: opencl_runtime
0 libsystem_kernel.dylib 0x00007fffb14bad42 __pthread_kill + 10
1 libsystem_pthread.dylib 0x00007fffb15a85bf pthread_kill + 90
2 libsystem_c.dylib 0x00007fffb1420420 abort + 129
3 libGPUSupportMercury.dylib 0x00007fffa98e6fbf gpusGenerateCrashLog + 158
4 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x000000010915f13b gpusKillClientExt + 9
5 libGPUSupportMercury.dylib 0x00007fffa98e7983 gpusQueueSubmitDataBuffers + 168
6 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa031 IntelCLCommandBuffer::getNew(GLDQueueRec*) + 31
7 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091a9f99 intelSubmitCLCommands(GLDQueueRec*, unsigned int) + 65
8 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091b00a1 CHAL_INTEL::ChalContext::ChalFlush() + 83
9 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa2c3 gldFinishQueue + 43
10 com.apple.opencl 0x00007fff9ffeeb37 0x7fff9ffed000 + 6967
11 com.apple.opencl 0x00007fff9ffef000 0x7fff9ffed000 + 8192
12 com.apple.opencl 0x00007fffa000ccca 0x7fff9ffed000 + 130250
13 com.apple.opencl 0x00007fffa001029d 0x7fff9ffed000 + 144029
14 libdispatch.dylib 0x00007fffb13568fc _dispatch_client_callout + 8
15 libdispatch.dylib 0x00007fffb1357536 _dispatch_barrier_sync_f_invoke + 83
16 com.apple.opencl 0x00007fffa001011d 0x7fff9ffed000 + 143645
17 com.apple.opencl 0x00007fffa000bda6 0x7fff9ffed000 + 126374
18 com.apple.opencl 0x00007fffa00011df clEnqueueReadBuffer + 813
19 simplisticComparison 0x0000000107b953cf BinomialMultiplication::execute(int) + 1791
20 simplisticComparison 0x0000000107b9ec7f main + 767
21 libdyld.dylib 0x00007fffb138c235 start + 1
Thread 1:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x000070000eed6b30 0 + 123145552751408
Thread 2:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
Thread 3:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x007865646e496d65 0 + 33888479226719589
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0000000000000000 rbx: 0x0000000000000006 rcx: 0x00007fff58074078 rdx: 0x0000000000000000
rdi: 0x0000000000000307 rsi: 0x0000000000000006 rbp: 0x00007fff580740a0 rsp: 0x00007fff58074078
r8: 0x0000000000000000 r9: 0x00007fffb140ba50 r10: 0x0000000008000000 r11: 0x0000000000000206
r12: 0x00007f92de80a7e0 r13: 0x00007f92e0008c00 r14: 0x00007fffba29e3c0 r15: 0x00007f92de801a00
rip: 0x00007fffb14bad42 rfl: 0x0000000000000206 cr2: 0x00007fffba280128
Logical CPU: 0
Error Code: 0x02000148
Trap Number: 133
I have a guess. The driver can crash in two ways:
We reference a bad buffer address. This is probably not your case.
We time out (exceed the TDR). A kernel has a few seconds to complete.
My money is on #2. If the larger value (steps) makes the GPU run too long, the system will kill things.
I am not familiar with the guts of Apple's Intel driver, but typically there is a way to disable the TDR in extreme cases. E.g. see the Windows Documenation on TDRs to get the gist. (Linux drivers have a way to disable this too.)
Normally we want to avoid running things that take super long and it might be a good idea to decompose the workload in some way so that you naturally don't hit this kill switch. E.g. perhaps chunk the "steps" into smaller chunks (pass in and save your state for parts you can't recompute).

Arduino: Formula to convert byte

Im looking for a way to modify a binary byte value on Arduino.
Because of the Hardware, its neccesarry, to split a two digit number into 2 4-bit.
the code to set output is wire.write(byte, 0xFF) which sets all outputs on High.
0xFF = binary 1111 1111
the formula should be convert a value like this:
e.g nr 35 is binary 0010 0011
but for my use it should displayed as 0011 0101 which would be refer to 53 in reality.
The first 4 bits are for a BCD-Input IC which displays the 5 from 35, the second 4 bits are for a BCD-Input IC which displays the 3 from 35.
Does anybody has a idea how to convert this by code, or like a mathematical formula?
Possible numbers are from 00 to 59.
Thank you for your help
To convert a value n between 0 and 99 to BCD:
((n / 10) * 16) + (n % 10)
assuming n is an integer and thus / is doing integer division; also assumes this will be stored in an unsigned byte.
(If this is not producing the desired result, please either explain how it is incorrect for the example given, or provide a different example for which it is incorrect.)
#include <string.h>
int num = // Any number from 0 to 59
int tens = num/10;
int units = num-(tens*10);
// Make string array for binary
string tensbinary;
int quotient = tens;
char buffer[1];
// Convert numbers
for (int i = 0; i < 4; i++)
{
quotientint = quotientint % 2;
sprintf(buffer, 1, "%d", quotientint);
binary.append(buffer);
}
// Repeat above for the units
// Now join the two together
binarytens.append(binaryunits);
I don't know if this will work, but still, you might be able to extrapolate based on the available information in my code.
The last thing you need to do is convert the string to binary.

Opencl size of local memory has impact on speed?

i am new in OpenCL and i am trying to compute histogram of grayscaled image. I am performing this computation on GPU nvidia GT 330M.
code is
__kernel void histogram(__global struct gray * input, __global int * global_hist, __local volatile int * histogram){
int local_offset = get_local_id(0) * 256;
int histogram_global_offset = get_global_id(0) * 256;
int offset = get_global_id(0) * 1920;
int value;
for(unsigned int i = 0; i < 256; i++){
histogram[local_offset + i] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 1920; i++){
value = input[offset + i].i;
histogram[local_offset + value]++;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 0; i < 256; i++){
global_hist[histogram_global_offset + i] = histogram[local_offset + i];
}
}
This computation is performed on image 1920*1080.
I am firing kernels with
queue.enqueueNDRangeKernel(kernel_histogram, cl::NullRange, cl::NDRange(1080), cl::NDRange(1));
When local size of histogram is set to 256 * sizeof(cl_int) speed of this computation is (through nvidia nsight performance analysis) 11 675 microseconds.
Because local workgroup size is set to one. I tried increase local workgroup size to 8. But when i increase local size of histogram to 256 * 8 * sizeof(cl_int) and compute with local wg size 1. I get 85 177 microseconds.
So when i fire it with 8 kernels per workgroup i dont get speedup from 11ms but from 85ms. So final speed with 8 kernels per worgroup is 13 714 microseconds.
But when i create computation bug, set local_offset to zero and size of local histogram is 256 * sizeof(cl_int) and use 8 kernels per workgroup i get much better time - 3 854 microsec.
Does anybody have some ideas to speed up this computation ?
Thanks!
This answer assumes you want to eventually reduce your histogram all the way down to 256 int values. You call the kernel with as many work groups as you have compute units on your device, and group size should be (as always) a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE on the device.
__kernel void histogram(__global struct gray * input, __global int * global_hist){
int group_id = get_group_id(0);
int num_groups = get_num_groups(0);
int local_id = get_local_id(0);
int local_size = get_local_size(0);
volatile __local int histogram[256];
int i;
for(i=local_id; i<256; i+=local_size){
histogram[i] = 0;
}
int rowNum, colNum, value, global_hist_offset
for(rowNum = group_id; rowNum < 1080; rowNum+=num_groups){
for(colNum = local_id; colNum < 1920; colNum += local_size){
value = input[rowNum*1920 + colNum].i;
atomic_inc(histogram[input]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
global_hist_offset = group_id * 256;
for(i=local_id; i<256; i+=local_size){
global_hist[global_hist_offset + i] = histogram[i];
}
}
Each work group works cooperatively on one row of the image at a time. Then the group moves on to another row, calculated using the num_groups value. This will work well no matter how many groups you have. For example, if you have 7 compute units, group 3 (the forth group) will start on row 3 in the image, and then every 7th row thereafter. Group 3 would compute 153 rows in total, and its final row would be row 1074. Some work groups may compute 1 more row -- groups 0 and 1 in this example.
The same interlacing is done within the work group when looking at the columns of the image. in the colNum loop, the Nth work item starts at column N, and skips ahead by local_size columns. The remainder for this loop shouldn't come in to play as often, because CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will likely be a factor of 1920. Try all work group sizes from (1..X) * CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE up to the maximum work group size for your device.
One final point about this kernel: the results are not identical to your original kernel. Your global_hist array is 1080 * 256 integers. The one I have needs to be num_groups * 256 integers. This helps if you want a full reduction, because there is much less to add after the kernel executes.

Arduino and RC Transmitter

I am new to Arduino and to this forum and this is my first Arduino project besides the tutorials.
I am trying to control a servo using a rc transmitter/receiver and the Arudino. The reason why I am using a Arduino instead of connecting the servo directly to the RC receiver is that the RC can only generate a PWM of 1000µs to 2000µs while I need a PWM of 600µs to 2400µs to get the full range of motion of my servo. What I have tried to do is to read the value from pulseIn(), then mapping this value to 0 to 180 degree as written in code below (which utilizes servo library).
However, with this code, the motor behaviour is weird. As I move the radio transmitter control stick through its range of motion, the motor rotates from 0 to 45 degrees, back from 45 to 0, 0 to 45, and back to 0 again instead of sweeping from 0 to 180 degrees. Could anyone please offer some help or advice?
Thank you very much
#include <Servo.h>
Servo myservo;
int ch1;
int ch2;
int ch3;
int degree;
void setup() {
pinMode(7, INPUT);
myservo.attach(9);
Serial.begin(9600);
}
void loop() {
ch3 = pulseIn(7, HIGH, 25000);
degree = ((ch3-1250)* 180)/700;
Serial.print("Channel 3:");
Serial.println(ch3);
myservo.write(degree);
delay(5); // waits 5ms for the servo to reach the position
}
You are overflowing the int data type. The signed value can only be -32768 to +32767. See int docs
Your formula is all int's and the compiler will not guess that you might need a larger intermediate value. The multiply by 180 is a red flag. (2000-1250)*180 = 135000 = boom
To understand the math, break down a formula into the individual operations as shown in the test program below. That is essentially what the compiler is doing for you.
Run the program below and you will see the failure. Just after the out value reaches 45, the intermediate value overflows and the formula breaks down.
in: 1040 out: 39 t0: -210 t1: 27736 t2: 39
in: 1048 out: 41 t0: -202 t1: 29176 t2: 41
in: 1056 out: 43 t0: -194 t1: 30616 t2: 43
in: 1064 out: 45 t0: -186 t1: 32056 t2: 45
in: 1072 out: -45 t0: -178 t1: -32040 t2: -45
in: 1080 out: -43 t0: -170 t1: -30600 t2: -43
Use this program below as a test fixture. Modify the data types to use unsigned int and you will be able to make the output behave as you need.
int ch3;
int degree;
void setup() {
ch3 = 1000;
Serial.begin(9600);
}
void loop() {
int t0, t1, t2;
degree = ((ch3-1250)* 180)/700;
t0 = ch3 - 1250;
t1 = t0 * 180;
t2 = t1 / 700;
Serial.print("in: ");
Serial.print(ch3);
Serial.print(" out: ");
Serial.print(degree);
Serial.print(" t0: ");
Serial.print(t0);
Serial.print(" t1: ");
Serial.print(t1);
Serial.print(" t2: ");
Serial.println(t2);
ch3 += 8;
if(ch3 > 2400) {
ch3 = 1000;
}
delay(100);
}
As a note, you may have more Arduino/servo luck on https://robotics.stackexchange.com/.
What are you seeing on the serial output? Is ch3 cycling from 0 to 45 or from 0 to 180? Don't forget that map() is designed to do what you're doing by hand here.
My first suspicion is that you're occasionally getting 0 back from pulseIn either because you're timing out, or you're starting your reading in the middle of a pulse (which could lead to a shorter pulse than you expect).

Resources