Question about pointer alignment - pointers

I'm working on a memory pool implementation and I'm a little confused about pointers alignment...
Suppose that I have a memory pool that hands out fixed size memory blocks, at the point of memory pool creation I malloc((size)*(num of blocks)). If what's being allocated are objects and the size comes from the sizeof operator alignment shouldn't be a concern, but if the size is uneven (he/she wants 100 byte blocks for whatever reason), then when I split the chunk given by malloc I'd end up with unaligned pointers. My question is should I always align the blocks to some boundary and if yes which?

Proper alignment is at least helpful (performance-wise) on most x86 implementations (and some kind of alignment is actually mandatory in other architectures). You might ask (like calloc does) for a pair of arguments, size of items in bytes and number of items, rather than just one (size in bytes, like malloc does); then you can intrinsically align (by rounding up block sizes) to the next-higher power of 2 above the item size (but switch to multiples of 16 bytes above 16, don't keep doubling forever, just like #derobert recommends and explains!-). This way, if a caller just wants N bytes w/o any alignment nor padding, they can always ask for N items of 1 byte each (just like they would with calloc and for the same reason;-).

X86 will work without alignment, but performance is better when data is aligned. Alignment for type is generally sizeof(type), up to a maximum of 16 (bytes).
I wrote this silly test program just to be sure (asuming malloc knows what its doing), and it returns 16 on my amd64 box. It returns 8 when compiled in 32-bit mode:
#include <stdlib.h>
#include <stdio.h>
int main() {
int i;
unsigned long used_bits = 0, alignment;
for (i = 0; i < 1000; ++i) {
used_bits |= (unsigned long)malloc(1); /* common sizes */
used_bits |= (unsigned long)malloc(2);
used_bits |= (unsigned long)malloc(4);
used_bits |= (unsigned long)malloc(8);
used_bits |= (unsigned long)malloc(16);
used_bits |= (unsigned long)malloc(437); /* random number */
}
alignment = 1;
while (!(used_bits & alignment)) {
alignment <<= 1;
}
printf("Alignment is: %lu\n", alignment);
return 0;
}

Related

OpenCL 'non-blocking' reads have higher cost than expected

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?
You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

QImage data alignment

The documentation of
QImage::QImage(uchar *data, int width, int height, Format format, QImageCleanupFunction cleanupFunction = Q_NULLPTR, void *cleanupInfo = Q_NULLPTR)
describes that the data, refered by parameter 'data', must be 32 bit aligned. http://doc.qt.io/qt-5/qimage.html#QImage-3 But it's at least unclear what is meant exactly. I assume, each pixel takes 32 bits. But that is not the case. Constructing an image like this is working:
uint8_t* rgb = new uint8_t[3 * height * width];
QImage Img(rgb, width, height, QImage::Format_RGB888);
But this is confusing. When I want to get the pixel values from the image, I thought I need to do this (since the data is 32 bit aligned and Qrgb is 32 bit):
QRgb*rawPixelData = (QRgb*) Img.bits();
for(uint32_t i = 0; i < (Img.width * Img.height); ++i)
{
qDebug() << "Red" << qRed(rawPixelData[i]);
qDebug() << "Green" << qGreen(rawPixelData[i]);
qDebug() << "Blue" << qBlue(rawPixelData[i]);
}
But this is not working (leads to a crash). So, I assume, the data is not 32bit aligned. So, isn't the data 32 bit aligned, or I'm understanding something wrong?
I assume that by the "data" they mean the array of bytes used. And by alignment they mean that the first byte of the array would be 32bit aligned and thus data % 4 would always be 0. It is not the internal alignment of every pixel, just the alignment of the memory block that contains the pixel data.
Furthermore, bits() returns a pointer to an unsigned byte, not a pointer to a QRgb. A QRgb is essentially just an integer:
typedef unsigned int QRgb;
I suspect you are getting a crash because the raw data is "compacted". Meaning that if your image has only RGB and no alpha, it will use only 24bits or 3 bytes per pixel, because that would eliminate a 25% memory usage overhead. As a result, you are walking off the actual data and getting a crash.
You should try iterating it as w * h * 3 unsigned chars and incrementing by 3 for each next pixel, and your rgb would be respectively the bytes at i, i+1, i+2.
It could probably work if your image format was RGBA.
And indeed if you bother to check the byteCount you'd realize that the amount of bytes used internally are the minimum amount for a given format:
QImage img(100, 100, QImage::Format_RGB888);
qDebug() << img.byteCount(); // 30000 or 3 bytes or 24 bits
QImage img2(100, 100, QImage::Format_RGB555);
qDebug() << img2.byteCount(); // 20000 or 2 bytes or 15 bits
QImage img3(100, 100, QImage::Format_RGBA8888);
qDebug() << img3.byteCount(); // 40000 or 4 bytes or 32 bits
But it's at least unclear what is meant exactly.
The expression is part of the software engineering vernacular and has nothing to do with the specific situation at hand: it doesn't have anything to do with Qt nor images nor pixels.
On platforms where Qt is supported, it has the following strict meaning:
uchar *data = ...;
Q_ASSERT(reinterpret_cast<uintptr_t>(data) & 3 == 0);
Or, on an arbitrary C++17 platform, it has the following strict meaning:
size_t size = ...;
uchar *data = ...;
Q_ASSERT(std::align(4, size, reinterpret_cast<void*&>(data), size) ==
reinterpret_cast<void*>(data));

Simple Vector Geometric Progression Design in OpenCL

I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!

Infra Red Arduino Custom Serial

Basically I am looking to make an serial-like system that runs communication between IR LEDs on an arduino. Below the code gets to the point having an array with a collection of 1s and 0s in it. I need to convert this 8 bit array into a single character and output it. But I don't know how to do this. Help would be appreciated.
int IR_serial_read(){
int output_val;
int current_byte[7];
int counter = 0;
IR_serial_port = digitalRead(4);
if (IR_serial_port == HIGH){
output_val =1;
}
if (IR_serial_port == LOW){
output_val =0;
}
current_byte[counter] = output_val;
counter +=1
}
This would best be done with bitwise operators, I think the or function would be of best use here as it will set a bit if the input is 1 and not change it if it is 0, could use a loop to loop through your array and set the bits.
Looking at your code, are you sure you are receiving all 8 bits? You seem to be saving 7 bits.
As you are creating a byte array solely for the purpose of using only 1s and 0s, it suggest immediately setting the bits in the same loop.
Here is the code that I suggest:
byte inputByte = 0; // Result from IR transfer. Bits are set progressively.
for (byte bit = 0; bit < 8; bit++) { // Read the IR receiver once for each bit in the byte
byte mask = digitalRead(4); // digitalRead returns 1 or 0 for HIGH and LOW
mask <<= bit; // Shift that 1 or 0 into the bit of the byte we are on
inputByte |= mask; // Set the bit of the byte depending on receive
}
This would also be put inside a loop to read all the bytes in your data stream.
It is designed for readability and can be optimised further. It reads the least significant bit first.
You can also apply the same technique to your array if you wish to keep using an array of bytes just for 1s and 0s, just replace the digitalRead with the array location (current_byte[bit]).

Struct Stuffing Incorrectly

I have the following struct:
typedef union
{
struct
{
unsigned char ID;
unsigned short Vdd;
unsigned char B1State;
unsigned short B1FloatV;
unsigned short B1ChargeV;
unsigned short B1Current;
unsigned short B1TempC;
unsigned short B1StateTimer;
unsigned short B1DutyMod;
unsigned char B2State;
unsigned short B2FloatV;
unsigned short B2ChargeV;
unsigned short B2Current;
unsigned short B2TempC;
unsigned short B2StateTimer;
unsigned short B2DutyMod;
} bat_values;
unsigned char buf[64];
} BATTERY_CHARGE_STATUS;
and I am stuffing it from an array as follows:
for(unsigned char ii = 0; ii < 64; ii++) usb_debug_data.buf[ii]=inBuffer[ii];
I can see that the array has the following (arbitrary) values:
inBuffer[0] = 80;
inBuffer[1] = 128;
inBuffer[2] = 12;
inBuffer[3] = 0;
inBuffer[4] = 23;
...
now I want display these values by changing the text of a QEditLine:
str=QString::number((int)usb_debug_data.bat_values.ID);
ui->batID->setText(str);
str=QString::number((int)usb_debug_data.bat_values.Vdd)
ui->Vdd->setText(str);
str=QString::number((int)usb_debug_data.bat_values.B1State)
ui->B1State->setText(str);
...
however, the QEditLine text values are not turning up as expected. I see the following:
usb_debug_data.bat_values.ID = 80 (correct)
usb_debug_data.bat_values.Vdd = 12 (incorrect)
usb_debug_data.bat_values.B1State = 23 (incorrect)
seems like 'usb_debug_data.bat_values.Vdd', which is a short, is not taking its value from inBuffer[1] and inBuffer[2]. Likewise, 'usb_debug_data.bat_values.B1State' should get its value from inBuffer[3] but for some reason is picking up its value from inBuffer[4].
Any idea why this is happening?
C and C++ are free to insert padding between elements of a structure, and beyond the last element, for whatever purposes it desires (usually efficiency but sometimes because the underlying architecture does not allow unaligned access at all).
So you'll probably find that items of two-bytes length are aligned to two-byte boundaries, so you'll end up with something like:
unsigned char ID; // 1 byte
// 1 byte filler, aligns following short
unsigned short Vdd; // 2 bytes
unsigned char B1State; // 1 byte
// 3 bytes filler, aligns following int
unsigned int myVar; // 4 bytes
Many compilers will allow you to specific how to pack structures, such as with:
#pragma pack(1)
or the gcc:
__attribute__((packed))
attribute.
If you don't want to (or can't) pack your structures, you can revert to field-by-filed copying (probably best in a function):
void copyData (BATTERY_CHARGE_STATUS *bsc, unsigned char *debugData) {
memcpy (&(bsc->ID), debugData, sizeof (bsc->ID));
debugData += sizeof (bsc->ID);
memcpy (&(bsc->Vdd), debugData, sizeof (bsc->Vdd));
debugData += sizeof (bsc->Vdd);
: : :
memcpy (&(bsc->B2DutyMod), debugData, sizeof (bsc->B2DutyMod));
debugData += sizeof (bsc->B2DutyMod); // Not really needed
}
It's a pain that you have to keep the structure and function synchronised but hopefully it won't be changing that much.
Structs are not packed by default so the compiler is free to insert padding between members. The most common reason is to ensure some machine dependent alignment. The wikipedia entry on data structure alignment is a pretty good place to start. You essentially have two choices:
insert compiler specific pragmas to force alignment (e.g, #pragma packed or __attribute__((packed))__.
write explicit serialization and deserialization functions to transform your structures into and from byte arrays
I usually prefer the latter since it doesn't make my code ugly with little compiler specific adornments everywhere.
The next thing that you are likely to discover is that the byte order for multi-byte integers is also platform specific. Look up endianness for more details

Resources