I try to optimize a working compute shader. Its purpose is to create an image: find the good color (using a little palette), and call imageStore(image, ivec2, vec4).
The colors are indexed, in an array of uint, in an UniformBuffer.
One color in this UBO is packed inside one uint, as {0-255, 0-255, 0-255, 0-255}.
Here the code:
struct Entry
*some other data*
uint rgb;
layout(binding = 0) uniform SConfiguration
Entry materials[MATERIAL_COUNT];
} configuration;
void main()
Entry material = configuration.materials[currentMaterialId];
float r = (material.rgb >> 16) / 255.;
float g = ((material.rgb & G_MASK) >> 8) / 255.;
float b = (material.rgb & B_MASK) / 255.;
imageStore(outImage, ivec2(gl_GlobalInvocationID.xy), vec4(r, g, b, 0.0));
I would like to clean/optimize a bit, because this color conversion looks bad/useless in the shader (and should be precomputed). My question is:
Is it possible to directly pack a vec4(r, g, b, 0.0) inside the UBO, using 4 bytes (like a R8G8B8A8) ?

Is it possible to do it directly? No.
But GLSL does have a number of functions for packing/unpacking normalized values. In your case, you can pass the value as a single uint uniform, then use unpackUnorm4x8 to convert it to a vec4. So your code becomes:
vec4 color = unpackUnorm4x8(material.rgb);
This is, of course, a memory-vs-performance tradeoff. So if memory isn't an issue, you should probably just pass a vec4 (never use vec3) directly.

Is it possible to directly pack a vec4(r, g, b, 0.0) inside the UBO, using 4 bytes (like a R8G8B8A8) ?
There is no way to express this directly as 4 single byte values; there is no appropriate data type in the shader to allow you to do declare this as a byte type.
However, why do you think you need to? Just upload it as 4 floats - it's a uniform so it's not like you are replicating it thousands of times, so the additional size is unlikely to be a problem in practice.


Computing the memory footprint (or byte length) of a map

I want to limit a map to be maximum X bytes. It seems there is no straightforward way of computing the byte length of a map though.
"encoding/binary" package has a nice Size function, but it only works for slices or "fixed values", not for maps.
I could try to get all key/value pairs from the map, infer their type (if it's a map[string]interface{}) and compute the length - but that would be both cumbersome and probably incorrect (because that would exclude the "internal" Go cost of the map itself - managing pointers to elements etc).
Any suggested way of doing this? Preferably a code example.
This is the definition for a map header:
// A header for a Go map.
type hmap struct {
// Note: the format of the Hmap is encoded in ../../cmd/gc/reflect.c and
// ../reflect/type.go. Don't change this structure without also changing that code!
count int // # live cells == size of map. Must be first (used by len() builtin)
flags uint32
hash0 uint32 // hash seed
B uint8 // log_2 of # of buckets (can hold up to loadFactor * 2^B items)
buckets unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing
nevacuate uintptr // progress counter for evacuation (buckets less than this have been evacuated)
Calculating its size is pretty straightforward (unsafe.Sizeof).
This is the definition for each individual bucket the map points to:
// A bucket for a Go map.
type bmap struct {
tophash [bucketCnt]uint8
// Followed by bucketCnt keys and then bucketCnt values.
// NOTE: packing all the keys together and then all the values together makes the
// code a bit more complicated than alternating key/value/key/value/... but it allows
// us to eliminate padding which would be needed for, e.g., map[int64]int8.
// Followed by an overflow pointer.
bucketCnt is a constant defined as:
bucketCnt = 1 << bucketCntBits // equals decimal 8
bucketCntBits = 3
The final calculation would be:
unsafe.Sizeof(hmap) + (len(theMap) * 8) + (len(theMap) * 8 * unsafe.Sizeof(x)) + (len(theMap) * 8 * unsafe.Sizeof(y))
Where theMap is your map value, x is a value of the map's key type and y a value of the map's value type.
You'll have to share the hmap structure with your package via assembly, analogously to thunk.s in the runtime.

QMap Memory Error

I am doing one project in which I define a data types like below
typedef QVector<double> QFilterDataMap1D;
typedef QMap<double, QFilterDataMap1D> QFilterDataMap2D;
Then there is one class with the name of mono_data in which i have define this variable
QFilterMap2D valid_filters;
mono_data Scan_data // Class
Now i am reading one variable from a .mat file and trying to save it in to above "valid_filters" QMap.
Qt Code: Switch view
for(int i=0;i<1;i++)
for(int j=0;j<1;j++)
The transferring is done successfully but then it gives run-time error
Windows has triggered a breakpoint in SpectralDataCollector.exe.
This may be due to a corruption of the heap, and indicates a bug in
SpectralDataCollector.exe or any of the DLLs it has loaded.
The output window may have more diagnostic information
Can anyone help in solving this problem. It will be of great help to me.
Different issues here:
1. Using double as key type for a QMap
Using a QMap<double, Foo> is a very bad idea. the reason is that this is a container that let you access a Foo given a double. For instance:
map[0.45] = foo1;
map[15.74] = foo2;
This is problematic, because then, to retrieve the data contained in map[key], you have to test if key is either equal, smaller or greater than other keys in the maps. In your case, the key is a double, and testing if two doubles are equals is not a "safe" operation.
2. Using an int as key while you defined it was double
i is an integer, and you said it should be a double.
3. Your loop only test for (i,j) = (0,0)
Are you aware that
for(int i=0;i<1;i++)
for(int j=0;j<1;j++)
is equivalent to:
4. Accessing a vector with operator[] is not safe
When you do:
You in fact do:
QFilterDataMap1D & v = Scan_Data.valid_filters[i]; // call QMap::operator[](double)
double d = v[j]; // call QVector::operator[](int)
The first one is safe, and create the entry if it doesn't exist. The second one is not safe, the jth element in you vector must already exist otherwise it would crash.
It seems you in fact want a 2D array of double (i.e., a matrix). To do this, use:
typedef QVector<double> QFilterDataMap1D;
typedef QVector<QFilterDataMap1D> QFilterDataMap2D;
Then, when you want to transfer one in another, simply use:
Scan_Data.valid_filters = valid_filters;
Or if you want to do it yourself:
for(int i=0;i<n;i++)
Scan_Data.valid_filters << QFilterDataMap1D();
for(int j=0;j<m;j++)
Scan_Data.valid_filters[i] << valid_filters[i][j];
If you want a 3D matrix, you would use:
typedef QVector<QFilterDataMap2D> QFilterDataMap3D;

OpenCL void pointer arithmetic - strange behavior

I have wrote an OpenCL kernel that is using the opencl-opengl interoperability to read vertices and indices, but probably this is not even important because I am just doing simple pointer addition in order to get a specific vertex by index.
uint pos = (index + base)*stride;
Here i am calculating the absolute position in bytes, in my example pos is 28,643,328 with a stride of 28, index = 0 and base = 1,022,976. Well, that seems correct.
Unfortunately, I cant use vload3 directly because the offset parameter isn't calculated as an absolute address in bytes. So I just add pos to the pointer void* vertices_gl
void* new_addr = vertices_gl+pos;
new_addr is in my example = 0x2f90000 and this is where the strange part begins,
vertices_gl = 0x303f000
The result (new_addr) should be 0x4B90000 (0x303f000 + 28,643,328)
I dont understand why the address vertices_gl is getting decreased by 716,800 (0xAF000)
I'm targeting the GPU: AMD Radeon HD5830
Ps: for those wondering, I am using a printf to get these values :) ( couldn't get CodeXL working)
There is no pointer arithmetic for void* pointers. Use char* pointers to perform byte-wise pointer computations.
Or a lot better than that: Use the real type the pointer is pointing to, and don't multiply offsets. Simply write vertex[index+base] assuming vertex points to your type containing 28 bytes of data.
Performance consideration: Align your vertex attributes to a power of two for coalesced memory access. This means, add 4 bytes of padding after each vertex entry. To automatically do this, use float8 as the vertex type if your attributes are all floating point values. I assume you work with position and normal data or something similar, so it might be a good idea to write a custom struct which encapsulates both vectors in a convenient and self-explaining way:
// Defining a type for the vertex data. This is 32 bytes large.
// You can share this code in a header for inclusion in both OpenCL and C / C++!
typedef struct {
float4 pos;
float4 normal;
} VertexData;
// Example kernel
__kernel void computeNormalKernel(__global VertexData *vertex, uint base) {
uint index = get_global_id(0);
VertexData thisVertex = vertex[index+base]; // It can't be simpler!
thisVertex.normal = computeNormal(...); // Like you'd do it in C / C++!
vertex[index+base] = thisVertex; // Of couse also when writing
Note: This code doesn't work with your stride of 28 if you just change one of the float4s to a float3, since float3 also consumes 4 floats of memory. But you can write it like this, which will not add padding (but note that this will penalize memory access bandwidth):
typedef struct {
float pos[4];
float normal[3]; // Assuming you want 3 floats here
} VertexData;

CUDA device pointer manipulation

I've used:
float *devptr;
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at:
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.

Get four 16bit numbers from a 64bit hex value

I have been through these related questions:
How to convert numbers between hexadecimal and decimal in C#?
How to Convert 64bit Long Data Type to 16bit Data Type
Way to get value of this hex number
But I did not get an answer probably because I do not understand 64bit or 16bit values.
I had posted a question on Picasa and face detection, to use the face detection that Picasa does to get individual pics from a photo containing many pictures. Automatic Face detection using API
In an answer #Joel Martinez linked to an answer on picasa help which said:
The number encased in rect64() is a 64-bit hexadecimal number.
Break that up into four 16-bit numbers.
Divide each by the maximum unsigned 16-bit number (65535) and you'll have four
numbers between 0 and 1.
the full text
#oedious wrote:- This is going to be
somewhat technical, so hang on. * The
number encased in rect64() is a 64-bit
hexadecimal number. * Break that up
into four 16-bit numbers. * Divide
each by the maximum unsigned 16-bit
number (65535) and you'll have four
numbers between 0 and 1. * The four
numbers remaining give you relative
coordinates for the face rectangle:
(left, top, right, bottom). * If you
want to end up with absolute
coordinates, multiple the left and
right by the image width and the top
and bottom by the image height.
A sample picasa.ini file:
How do I get the 4 numbers from the 64 bit hex?
I am sorry people, currently I do not understand the answers. I guess I will have to learn some C++ (I am a PHP & Java Web Developer with weakness in Math) before I can jump in and write a something which will cut up an image into multiple images with the help of some co-ordinates. I am looking into CodeLab and creating plugins for too
If you want basics, say you have this hexadecimal number:
We split it into your 4 parts on paper, so all that's left is to extract them. This involves using a ffff mask to block out everything else besides our number (f masks nothing, 0 masks everything) and sliding it over each part. So we have:
part 1: 4444333322221111 & ffff = 1111
part 2: 4444333322221111 & ffff0000 = 22220000
part 3: 4444333322221111 & ffff00000000 = 333300000000
part 4: 4444333322221111 & ffff000000000000 = 4444000000000000
All that's left is to remove the 0's at the end. All in all, in C, you'd write this as:
int GetPart(int64 pack, int n) // where you define int64 as whatever your platform uses
{ // __int64 in msvc
return (pack & (0xffff << (16*n)) >> (16*n);
So basically, you calculate the mask as 0xffff (2 bytes) moved to the right 16*n bits (0 for the first, 16 for the 2nd, 32 for the 3rd and 48 for the 4th), apply it over the number to mask out everything but the part we're interested in, then shift the result back 16*n bits to clear out those 0's at the end.
Some additional reading: Bitwise operators in C.
Hope that helps!
Here is the algorithm:
The remainder of the division by 0x10000 (65536) will give you the first number.
Take the result then divide by 0x10000 (65536) again, the remainder will give you the second number.
Take the result the divide by 0x10000 (65536) again, the remainder will give you the third number.
The result is the fourth number.
It depends on your programming language - in C# i.e. you can use the BitConverter class, which allows you to extract a number based on the byte position within a byte array.
UInt64 largeHexNumber = 420404334;
byte[] hexData = BitConverter.GetBytes(largeHexNumber);
UInt16 firstValue = BitConverter.ToUInt16(hexData, 0);
UInt16 secondValue = BitConverter.ToUInt16(hexData, 2);
UInt16 thirdValue = BitConverter.ToUInt16(hexData, 4);
UInt16 forthValue = BitConverter.ToUInt16(hexData, 6);
It depends on the language. For the C-family of languages, it can be done like this (in C#):
UInt64 number = 0x4444333322221111;
//to get the ones, use a mask
// 0x4444333322221111
const UInt64 mask1 = 0xFFFF;
UInt16 part1 = (UInt16)(number & mask1);
//to get the twos, use a mask then shift
// 0x4444333322221111
const UInt64 mask2 = 0xFFFF0000;
UInt16 part2 = (UInt16)((number & mask2) >> 16);
// 0x4444333322221111
const UInt64 mask3 = 0xFFFF00000000;
UInt16 part3 = (UInt16)((number & mask3) >> 32);
// 0x4444333322221111
const UInt64 mask4 = 0xFFFF000000000000;
UInt16 part4 = (UInt16)((number & mask4) >> 48);
What I think you are being asked to do is take the 64 bits of data you have and treat it like 4 16-bit integers. From there you are taking the 16-bit values and converting them to percentages. Those percentages, when multiplied to the image height/width, give you 4 coordinates.
How you do this depends on the language you're programming in.
I needed to convert the crop=rect64() values from picasa.ini file.
I created the following ruby method with the above information.
def coordinates(hex_num)
It works, but I needed to add the .reverse method on the array to achieve the desired result.
