OpenCL Image reading not working on dynamically allocated array - opencl

I'm writing an OpenCL program that applies a convolution matrix on an image. Everything works fine if I store all pixel on an array image[height*width][4] (line 65,commented) (sorry, I speak Spanish, and I code mostly in Spanish). But, since the images I'm working with are really large, I need to allocate the memory dynamically. I execute the code, and I get a Segmentation fault error.
After some poor man's debugging, I found out the problem arises after executing the kernel and reading the output image back into the host, storing the data into the dynamically allocated array. I just can't access the data of the array without getting the error.
I think the problem is the way the clEnqueueReadImage function (line 316) writes the image data into the image array. This array was allocated dynamically, so it has no predefined "structure".
But I need a solution, and I can't find it, nor on my own or on Internet.
The C program and the OpenCL kernel are here:
https://gist.github.com/MigAGH/6dd0fddfa09f5aabe7eb0c2934e58cbe

Don't use pointers to pointers (unsigned char**). Use a regular pointer instead:
unsigned char* image = (unsigned char*)malloc(sizeof(unsigned char)*ancho*alto*4);
Then in the for loop:
for(i=0; i<ancho*alto; i++){
unsigned char* pixel = (unsigned char*)malloc(sizeof(unsigned char)*4);
fread (pixel, 4, 1, bmp_entrada);
image[i*4] = pixel[0];
image[i*4+1] = pixel[1];
image[i*4+2] = pixel[2];
image[i*4+3] = pixel[3];
free(pixel);
}

Related

Is there a minimum string length for F() to be useful?

Is there a limit for short strings where using the F() macro brings more RAM overhead then saving?
For (a contrived) example:
Serial.print(F("\n"));
Serial.print(F("Hi"));
Serial.print(F("there!"));
Serial.print(F("How do you doyou how?"));
Would any one of those be more efficient without the F()?
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much? Also, is heap fragmentation a concern here?
I'm looking at this purely from SRAM-conserving perspective.
From a purely SRAM-conserving perspective all of your examples are identical in that no SRAM is used. At run-time some RAM is used, but only momentarily on the stack. Keep in mind that calling println() (w/o any parameters) uses some stack/RAM.
For a single character it will take up less space in flash if a char is passed into print or println. For example:
Serial.print('\n');
The char will be in flash (not static RAM).
Using
Serial.print(F("\n"));
will create a string in flash memory that is two bytes long (newline char + null terminator) and will additionally pass a pointer to that string to print which is probably two bytes long.
Additionally at runtime, using the F macro will result in two fetches ('\n' and the null terminator) from flash. While fetches from flash are fast, passing in a char results in zero fetches from flash, which is a tiny bit faster.
I don't think there is any minimum size of the string to be useful. If you look at how the outputting is implemented in Print.cpp:
size_t Print::print(const __FlashStringHelper *ifsh)
{
PGM_P p = reinterpret_cast<PGM_P>(ifsh);
size_t n = 0;
while (1) {
unsigned char c = pgm_read_byte(p++);
if (c == 0) break;
n += write(c);
}
return n;
}
You can see from there that only one byte of RAM is used at a time (plus a couple of variables), as it pulls the string from PROGMEM a byte at a time. These are all on the stack so there is no ongoing overhead.
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much?
No, it doesn't as I showed above. It outputs a byte at a time. There is no copying (in bulk) of the string into RAM first.
Also, is heap fragmentation a concern here?
No, the code does not use the heap.

Reading unsigned short array using Qt in C++

I'm trying to read an array of unsigned shorts using the Qt API. Unfortunately, I'm not getting the desired results.
The following code
QFile in(fileName);
int len = in.size();
QDataStream d(&in);
quint16 *data = new quint16[len];
qDebug() << data[0];
qDebug() << data[1];
d >> data[0];
qDebug() << data[0];
qDebug() << data[1];
outputs
52685
52685
13109
52685
Implying that the data is only changed at the first array position. Also, I always thought that arrays are zero initialized? Using a QByteArray doesn't seem to work here, that's why I'm trying to use a array of quint16 (= unsigned shorts). Using a loop may be an option, but I'm trying to avoid a costly loop where ever possible.
So, how do fill said array (data) with the desired data from the file? Is it possible to carry the data using a QByteArray?
First of all, in.size() returns the size of the file in bytes, and since you are using unsigned shorts (which are 2 bytes each), the size of your data array should be len/2.
Also, QDataStream is provided for serialization purposes. This means that it is mainly useful for extracting single objects at a time. See the documentation for QDataStrean for more information.
You can extract the whole array without loops or copying with this code:
QFile in(fileName);
in.open(QFile::ReadOnly);
QByteArray byteArray = in.readAll();
quint16 *data = (quint16*) byteArray.data();
If you only wish to read data but never modify it, you can make your program a lot faster by changing the last line to:
const quint16* data = (const quint16*) byteArray.constData();
Keep in mind however that with this (somewhat ugly) code the pointer will only be valid for the lifetime of the QByteArray object. This usually means that you can only use data until the end of the function.
If you wish your data to persist longer than that, you must allocate the array and read directly into it:
QFile in(fileName);
in.open(QFile::ReadOnly);
int len = in.size();
quint16 *data = new quint16[len/2];
in.read((char*)data, len);
This way, you can access the data until you delete[] data.
Finally, to answer your subquestion, arrays are zero-initialized only if they are globally defined (and you shouldn't rely even on this). Arrays allocated with new or malloc are never zero-initialized for performance reasons.
If you look at the documentation for QDataStream, you'll see there are two other methods at your disposal worth considering, if you don't want to use a loop:
QDataStream::readBytes - This will allocate a char[] buffer for you. Or,
QDataStream::readRawData - This will read data into a buffer you provide.
The problem with these is they work with char (bytes), not quint16 as you desire.
I would recommend using a loop to read quint16s. It will be the most clear code. And any respectable underlying stream implementation will be reading-ahead from the hardware into a buffer, so that your many successive >> calls won't be as expensive.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

16 byte memory alignment using SSE instructions

i am trying to get rid of unaligned loads and stores for SSE instructions for my application by replacing the
_mm_loadu_ps()
by
_mm_load_ps()
and allocating memory with:
float *ptr = (float *) _mm_malloc(h*w*sizeof(float),16)
instead of:
float *ptr = (float *) malloc(h*w*sizeof(float))
However wehen i print the pointer addresses using:
printf("%p\n", &ptr)
I get output:
0x2521d20
0x2521d28
0x2521d30
0x2521d38
0x2521d40
0x2521d48
...
This is not 16-byte aligned, even though i used the _mm_malloc function?
And when using the aligned load/store operations for the SSE instructions i yield a segmentation error since the data isn't 16-byte aligned.
Any ideas why it isn't aligned properly or any other ideas to fix this?
Thanks in advance!
Update
Using the
printf("%p\n",ptr)
solved the problem with the memory alignment, the data is indeed properly aligned.
However i still get a segmentation fault when trying to do an aligned load/store on this data and i'm suspecting it's a pointer issue.
When allocating the memory:
contents* instance;
instance.values = (float *) _mm_malloc(h*w*sizeof(float),16);
I have a struct with:
typedef struct{
...
float** values;
...
}contents;
In the code i then execute in another function, with a pointer to contents passed as argument:
__m128 tmp = _mm_load_ps(&contents.values);
Do you guys see anything i am missing?
Thanks for all the help so far :)
Change:
printf("%p\n", &ptr)
to:
printf("%p\n", ptr)
It's the memory that ptr is pointing to that needs to be 16 byte aligned, not the actual pointer variable itself.

Using memcpy to change a jnz to a jmp

Not used memcpy much but here's my code that doesn't work.
memcpy((PVOID)(enginebase+0x74C9D),(void *)0xEB,2);
(enginebase+0x74C9D) is a pointer location to the address of the bytes that I want to patch.
(void *)0xEB is the op code for the kind of jmp that I want.
Only problem is that this crashes the instant that the line tries to run, I don't know what I'm doing wrong, any incite?
The argument (void*)0xEB is saying to copy memory from address 0xEB; presumably you want something more like
unsigned char x = 0xEB;
memcpy((void*)(enginebase+0x74c9d), (void*)&x, 2);
in order to properly copy the value 0xEB to the destination. BTW, is 2 the right value to copy a single byte to program memory? Looks like it should be 1, since you're copying 1 byte. I'm also under the assumption that you can't just do
((char*)enginebase)[0x74c9d] = 0xEB;
for some reason? (I don't have any experience overwriting program memory intentionally)
memcpy() expect two pointers for the source and destination buffers. Your second argument is not a pointer but rather the data itself (it is the opcode of jnz, as you described it). If I understand correctly what you are trying to do, you should set an array with the opcode as its contetns, and provide memcpy() with the pointer to that array.
The program crashes b/c you try to reference a memory location out of your assigned space (address 0xEB).

Resources