16 byte memory alignment using SSE instructions

16 byte memory alignment using SSE instructions - intel

i am trying to get rid of unaligned loads and stores for SSE instructions for my application by replacing the
_mm_loadu_ps()
by
_mm_load_ps()
and allocating memory with:
float *ptr = (float *) _mm_malloc(h*w*sizeof(float),16)
instead of:
float *ptr = (float *) malloc(h*w*sizeof(float))
However wehen i print the pointer addresses using:
printf("%p\n", &ptr)
I get output:
0x2521d20
0x2521d28
0x2521d30
0x2521d38
0x2521d40
0x2521d48
...
This is not 16-byte aligned, even though i used the _mm_malloc function?
And when using the aligned load/store operations for the SSE instructions i yield a segmentation error since the data isn't 16-byte aligned.
Any ideas why it isn't aligned properly or any other ideas to fix this?
Thanks in advance!
Update
Using the
printf("%p\n",ptr)
solved the problem with the memory alignment, the data is indeed properly aligned.
However i still get a segmentation fault when trying to do an aligned load/store on this data and i'm suspecting it's a pointer issue.
When allocating the memory:
contents* instance;
instance.values = (float *) _mm_malloc(h*w*sizeof(float),16);
I have a struct with:
typedef struct{
...
float** values;
...
}contents;
In the code i then execute in another function, with a pointer to contents passed as argument:
__m128 tmp = _mm_load_ps(&contents.values);
Do you guys see anything i am missing?
Thanks for all the help so far :)

Change:
printf("%p\n", &ptr)
to:
printf("%p\n", ptr)
It's the memory that ptr is pointing to that needs to be 16 byte aligned, not the actual pointer variable itself.

Related

Vector contains data but reports length is 0, can be accessed by some functions

I've written a wrapper for a camera library in Rust that commands and operates a camera, and also saves an image to file using bindgen. Once I command an exposure to start (basically telling the camera to take an image), I can grab the image using a function of the form:
pub fn GetQHYCCDSingleFrame(
handle: *mut qhyccd_handle,
w: *mut u32,
...,
imgdata: &mut [u8],) -> u32 //(u32 is a retval)
In C++, this function was:
uint32_t STDCALL GetQHYCCDSingleFrame(qhyccd_handle: *handle, ..., uint8_t *imgdata)
In C++, I could pass in a buffer of the form imgdata = new unsigned char[length_buffer] and the function would fill the buffer with image data from the camera.
In Rust, similarly, I can pass in a buffer in the form of a Vec: let mut buffer: Vec<u8> = Vec::with_capacity(length_buffer).
Currently, the way I have structured the code is that there is a main struct, with settings such as the width and height of image, the camera handle, and others, including the image buffer. The struct has been initialized as a mut as:
let mut main_settings = MainSettings {
width: 9600,
...,
buffer: Vec::with_capacity(length_buffer),
}
There is a separate function I wrote that takes the main struct as a parameter and calls the GetQHYCCDSingleFrame function:
fn grab_image(main_settings: &mut MainSettings) {
let retval = unsafe { GetQHYCCDSingleFrame(main_settings.cam_handle, ..., &mut main_settings.image_buffer) };
}
Immediately after calling this function, if I check the length and capacity of main_settings.image_buffer:
println!("Elements in buffer are {}, capacity of buffer is {}.", main_settings.image_buffer.len(), main_settings.image_buffer.capacity());
I get 0 for length, and the buffer_length as the capacity. Similarly, printing any index such as main_settings.image_buffer[0] or 1 leads to a panic exit saying len is 0.
This would make me think that the GetQHYCCDSingleFrame code is not working properly, however, when I save the image_buffer to file using fitsio and hdu.write_region (fitsio docs linked here), I use:
let ranges = [&(x_start..(x_start + roi_width)), &(y_start..(y_start+roi_height))];
hdu.write_region(&mut fits_file, &ranges, &main_settings.image_buffer).expect("Could not write to fits file");
This saves an actual image to file with the right size and is a perfectly fine image (exactly what it would look if I took using the C++ program). However, when I try to print the buffer, for some reason is empty, yet the hdu.write_region code is able to access data somehow.
Currently, my (not good) workaround is to create another vector that reads data from the saved file and saves to a buffer, which then has the right number of elements:
main_settings.new_buffer = hdu.read_region(&mut fits_file, &ranges).expect("Couldn't read fits file");
Why can I not access the original buffer at all, and why does it report length 0, when the hdu.write_region function can access data from somewhere? And where exactly is it accessing the data from, and how can correctly I access it as well? I am bit new to borrowing and referencing, so I believe I might be doing something wrong in borrowing/referencing the buffer, or is it something else?
Sorry for the long story, but the details would probably be important for everything here. Thanks!

Well, first of all, you need to know that Vec<u8> and &mut [u8] are not quite the same as C or C++'s uint8_t *. The main difference is that Vec<u8> and &mut [u8] have the size of the array or slice saved within themselves, while uint8_t * doesn't. The Rust equivalent to C/C++ pointers are raw pointers, like *mut [u8]. Raw pointers are safe to build, but requires unsafe to be used. However, even tho they are different types, a smart pointer as &mut [u8] can be casted to a raw pointer without issue AFAIK.
Secondly, the capacity of a Vec is different of its size. Indeed, to have good performances, a Vec allocates more memory than you use, to avoid reallocating on each new element added into vector. The length however is the size of the used part. In your case, you ask the Vec to allocate a heap space of length length_buffer, but you don't tell them to consider any of the allocated space to be used, so the initial length is 0. Since C++ doesn't know about Vec and only use a raw pointer, it can't change the length written inside the Vec, that stays at 0. Thus the panicking.
To resolve it, I see multiple solutions:
Changing the Vec::with_capacity(length_buffer) into vec![0; length_buffer], explicilty asking to have a length of length_buffer from the start
Using unsafe code to explicitly set the length of the Vec without touching what is inside (using Vec::from_raw_parts). This might be faster than the first solution, but I'm not sure.
Using a Box<[u8; length_buffer]>, which is like a Vec but without reallocation and with the length that is the capacity
If your length_buffer is constant at compile time, using a [u8; length_buffer] would be much more efficient as no allocation is needed, but it comes with downsides, as you probably know

OpenCL Image reading not working on dynamically allocated array

I'm writing an OpenCL program that applies a convolution matrix on an image. Everything works fine if I store all pixel on an array image[height*width][4] (line 65,commented) (sorry, I speak Spanish, and I code mostly in Spanish). But, since the images I'm working with are really large, I need to allocate the memory dynamically. I execute the code, and I get a Segmentation fault error.
After some poor man's debugging, I found out the problem arises after executing the kernel and reading the output image back into the host, storing the data into the dynamically allocated array. I just can't access the data of the array without getting the error.
I think the problem is the way the clEnqueueReadImage function (line 316) writes the image data into the image array. This array was allocated dynamically, so it has no predefined "structure".
But I need a solution, and I can't find it, nor on my own or on Internet.
The C program and the OpenCL kernel are here:
https://gist.github.com/MigAGH/6dd0fddfa09f5aabe7eb0c2934e58cbe

Don't use pointers to pointers (unsigned char**). Use a regular pointer instead:
unsigned char* image = (unsigned char*)malloc(sizeof(unsigned char)*ancho*alto*4);
Then in the for loop:
for(i=0; i<ancho*alto; i++){
unsigned char* pixel = (unsigned char*)malloc(sizeof(unsigned char)*4);
fread (pixel, 4, 1, bmp_entrada);
image[i*4] = pixel[0];
image[i*4+1] = pixel[1];
image[i*4+2] = pixel[2];
image[i*4+3] = pixel[3];
free(pixel);
}

OpenCL kernel fails to compile asking for address space qualifier

The following opencl code fails to compile.
typedef struct {
double d;
double* da;
long* la;
uint ui;
} MyStruct;
__kernel void MyKernel (__global MyStruct* s) {
}
The error message is as follows.
line 11: error: kernel pointer arguments must point to addrSpace global, local, or constant
__kernel void MyKernel (__global MyStruct* s) {
^
As you can see I have clearly qualified the argument with '__global' as the error suggests I should. What am I doing wrong and how can I resolve this error?
Obviously this happens during kernel compilation so I haven't posted my host code here as it doesn't even get further than this.
Thanks.

I think the problem is that you have pointers in your struct, which is not allowed. You cannot point to host memory from your kernel like that, so pointers in kernel argument structs don't make much sense. Variable-sized arrays are backed up in OpenCL by a cl_mem host object, and that counts for one whole argument, so as far as I know, you can only pass variable-sized arrays directly as a kernel argument (and adjust the number of work units accordingly, of course).
You might prefer to put size information in your struct and pull out the arrays as standalone kernel arguments.

Global variable touched by a passed-in parameter becomes unusable

folks!
I pass a struct full of data to my kernel, and I run into the following difficulty using it (very stripped down):
[edit: mac osx / xcode 3.2 on mac book pro; this compile is obviously for cpu]
typedef struct
{
float xoom;
int sizex;
} varholder;
float zX, xd;
__kernel void Harlan( __global varholder * vh )
{
int X = get_global_id(0), Y = get_global_id(1);
zX = ( ( X - vh->sizex/2 ) / vh->xoom + vh->sizex/2 ); // (a)
xd = zX; // (b) BOOM!!
}
after executing line (a), the line marked (b), a simple assignment, gives "LLVM compiler failed to compile a function".
if, however, we do not execute line (a), then line (b) is fine.
So, through my fiddling around a LOT with this, it seems as if it is the assignment statement (a), which uses a passed-in parameter, that messes up the future access of the variable zX. However, of course I need to be able to use the results of calculations further down the line.
I have zX and xd declared at the file level because my helper functions need them.
Any thoughts?
Thanks!
David
p.s. I'm now registered so will be able to upvote and accept answers, which I am sadly unable to do for the last person who helped me (used same username to register, but can't seem to vote on the old post; sorry!).

No, say it ain't so!
I am sincerely hoping that this is not a "correct" answer to my own question. I found on another forum (though not the same question asked!) the following, and I am afraid that it refers to what I'm trying to do:
(quote)
You're doing something the standard prohibits. Section 6.5 says:
'All program scope variables must be declared in the __constant address space.'
In other words, program scope variables cannot be mutable.
(end quote)
... well, tcha!!!! What an astoundingly inconvenient restriction! I'm sure there's reasoning behind it.
[edit: Not At All inconvenient! it was in fact astonishingly easy to work around, given a fresh start the next morning. (And no alcohol.)]
You guys & dolls all knew this, right, and didn't have the heart to tell me?...

Using memcpy to change a jnz to a jmp

Not used memcpy much but here's my code that doesn't work.
memcpy((PVOID)(enginebase+0x74C9D),(void *)0xEB,2);
(enginebase+0x74C9D) is a pointer location to the address of the bytes that I want to patch.
(void *)0xEB is the op code for the kind of jmp that I want.
Only problem is that this crashes the instant that the line tries to run, I don't know what I'm doing wrong, any incite?

The argument (void*)0xEB is saying to copy memory from address 0xEB; presumably you want something more like
unsigned char x = 0xEB;
memcpy((void*)(enginebase+0x74c9d), (void*)&x, 2);
in order to properly copy the value 0xEB to the destination. BTW, is 2 the right value to copy a single byte to program memory? Looks like it should be 1, since you're copying 1 byte. I'm also under the assumption that you can't just do
((char*)enginebase)[0x74c9d] = 0xEB;
for some reason? (I don't have any experience overwriting program memory intentionally)

memcpy() expect two pointers for the source and destination buffers. Your second argument is not a pointer but rather the data itself (it is the opcode of jnz, as you described it). If I understand correctly what you are trying to do, you should set an array with the opcode as its contetns, and provide memcpy() with the pointer to that array.
The program crashes b/c you try to reference a memory location out of your assigned space (address 0xEB).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

16 byte memory alignment using SSE instructions - intel

Change: printf("%p\n", &ptr) to: printf("%p\n", ptr) It's the memory that ptr is pointing to that needs to be 16 byte aligned, not the actual pointer variable itself.

Related

Vector contains data but reports length is 0, can be accessed by some functions

OpenCL Image reading not working on dynamically allocated array

OpenCL kernel fails to compile asking for address space qualifier

Global variable touched by a passed-in parameter becomes unusable

Using memcpy to change a jnz to a jmp

Categories

Resources