trying to finish variable locations parser for hlsl, but can't find a way to get structure member size, in this question How to reflect information about hlsl struct members? it was recommended to use offset, but it won't give the actual size of member because of 16-byte packing, so if structure consists of float2 and float4 - it's total size is 32 bytes, offset of second member is 16 bytes, but real size of the first member is 8 bytes and not 16 bytes, i understand that in terms of passed memory there is 16 bytes for that variable and only 8 used, but i have "safety check" that when some value is put in stream in location of this variable - it is of the same size (in case of checking if value at least of variable size - placing it in stream can override data of next variables, and in case of checking if value is of variable size or less - it can lead to hard-to-track bugs if i forget to pass enough data for variable), so is there a way to get the real size of structure member?
Related
I checked the storage size but I'm confused when it comes to storing numbers.
In case of Bytes, what does "Byte length" mean? If I store -128 what's the length? And in case of 12?
In case of Floating-point number and Integer, it doesn't matter if I store 325 or 9.9999999999999 it will always be 8 bytes?
In case of Array? Let' say we have ["ab", "bcd"], what's the size, (2+3=5) or (2+1)+(3+1)=7
If you store an array of bytes, the size will simply be the length of that array. An array with a single byte value of -128 is still just one byte.
Yes, all numbers occupy the same 8-byte size, even if you don't see a fractional part.
The documentation says it's the sum of the array element sizes, so I would expect 7, the sum of the two individual string sizes, each encoded in UTF-8 + 1
Here is a detailed description of the dlmalloc algorithm: http://g.oswego.edu/dl/html/malloc.html
A dlmalloc chunk is bookended by some metadata, which includes information about the amount of space in the chunk. Two contiguous free chunks might look like
[metadata | X bytes free space | metadata ][metadata | X bytes free space | metadata]
Block A Block B
In that case we want to coalesce block B into block A. Now how many bytes of free space should block A report?
I think it should be 2X + 2 size(metadata) bytes, since now the coalesced block looks like:
[metadata | X bytes free space metadata metadata X bytes free space | metadata]
But I'm wondering if this is correct, because I have a textbook that says the metadata will report 2X bytes without including the extra space we get from being able to write over the metadata.
You can see the answer yourself by looking at the source. Begin with line 1876 to verify your diagram. The metadata is just two size_t unsigned integers, accessed by aliasing a struct malloc_chunk (line 1847). Field prev_size is the size of the previous chunk, and size is the size of this one. Both include the size of the struct malloc_chunk itself. This will be 8 or 16 bytes on nearly all machines depending on whether the code is compiled for 32- or 64-bit addressing.
The "normal case" coalescing code starts at line 3766. You can see that the size variable it's using to track coalescing is chunk size.
So - yeah - in the code blocks marked /* consolidate backward */ and /* consolidate forward */, when he adds the size of the preceding and succeeding chunks, he's implicitly adding the size of the struct malloc_chunk as you suspected.
This shows that your interpretation is correct. My expectation is that the textbook author just got sloppy about the difference between chunk size (which includes metadata) and the size of the memory block allocated to the user. Incidentally, malloc takes care of this difference at line 3397.
Perhaps the bigger lesson here is that - when you're trying to learn anything - you should never skip an opportunity to go straight to the first-hand source and figure stuff out for yourself.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).
my problem why my program takes much large time to execute, this program is supposed to check the user password, the approach used is
take password form console in to array and
compare it with previously saved password
comparision is done by function str_cmp()-returns zero if strings are equal,non zero if not equal
#include<stdio.h>
char str_cmp(char *,char *);
int main(void)
{
int i=0;
char c,cmp[10],org[10]="0123456789";
printf("\nEnter your account password\ntype 0123456789\n");
for(i=0;(c=getchar())!=EOF;i++)
cmp[i]=c;
if(!str_cmp(org,cmp))
{
printf("\nLogin Sucessful");
}
else
printf("\nIncorrect Password");
return 0;
}
char str_cmp(char *porg,char *pcmp)
{
int i=0,l=0;
for(i=0;*porg+i;i++)
{
if(!(*porg+i==*pcmp+i))
{
l++;
}
}
return l;
}
There are libraries available to do this much more simply but I will assume that this is an assignment and either way it is a good learning experience. I think the problem is in your for loop in the str_cmp function. The condition you are using is "*porg+i". This is not really doing a comparison. What the compiler is going to do is go until the expression is equal to 0. That will happen once i is so large that *porg+i is larger than what an "int" can store and it gets reset to 0 (this is called overflowing the variable).
Instead, you should pass a size into the str_cmp function corresponding to the length of the strings. In the for loop condition you should make sure that i < str_size.
However, there is a build in strncmp function (http://www.elook.org/programming/c/strncmp.html) that does this exact thing.
You also have a different problem. You are doing pointer addition like so:
*porg+i
This is going to take the value of the first element of the array and add i to it. Instead you want to do:
*(porg+i)
That will add to the pointer and then dereference it to get the value.
To clarify more fully with the comparison because this is a very important concept for pointers. porg is defined as a char*. This means that you have a variable that has the memory address of a 'char'. When you use the dereference operator (*, for example *porg) on the variable, it returns the value at stored in that piece of memory. However, you can add a number to the memory location to move to a different memory location. porg + 1 is going to return the memory location after porg. Therefore, when you do *porg + 1 you are getting the value at the memory address and adding 1 to it. On the other hand, when you do *(porg + 1) you are getting the value at the memory address one after where porg is pointing to. This is useful for arrays because arrays are store their values one after another. However, a more understandable notation for doing this is: porg[1]. This says "get the value 1 after the beginning of the array" or in other words "get the second element of the array".
All conditions in C are checking if the value is zero or non-zero. Zero means false, and every other value means true. When you use this expression (*porg + 1) for a condition it is going to do the calculation (value at porg + 1) and check if it is zero or not.
This leads me to the other very important concept for programming in C. An int can only hold values up to a certain size. If the variable is added to enough where it is larger than that maximum value, it will cycle around to 0. So lets say the maximum value of an int is 256 (it is in fact much larger). If you have an int that has the value of 256 and add 1 to it, it will become zero instead of 257. In reality the maximum number is 65,536 for most compilers so this is why it is taking so long. It is waiting until *porg + i is greater than 65,536 so that it becomes zero again.
Try including string.h:
#include <string.h>
Then use the built-in strcmp() function. The existing string functions have already been written to be as fast as possible in most situations.
Also, I think your for statement is messed up:
for(i=0;*porg+i;i++)
That's going to dereference the pointer, then add i to it. I'm surprised the for loop ever exits.
If you change it to this, it should work:
for(i=0;porg[i];i++)
Your original string is also one longer than you think it is. You allocate 10 bytes, but it's actually 11 bytes long. A string (in quotes) is always ended with a null character. You need to declare 11 bytes for your char array.
Another issue:
if(!(*porg+i==*pcmp+i))
should be changed to
if(!(porg[i]==pcmp[i]))
For the same reasons listed above.
Given the following structure
typedef struct
{
float3 position;
float8 position1;
} MyStruct;
I'm creating a buffer to pass it as a pointer to the kernel the buffer will have the previous buffer format.
I understand that I've to add 4 bytes in the buffer after writing the three floats to get the next power of two (16 bytes) but I don't understand why I've to add another 16 bytes extra before writing the bytes of position1. Otherwise I get wrong values in position1.
Can someone explain me why?
A float8 is a vector of 8 floats, each float being 4 bytes. That makes a size of 32 bytes. As per section 6.1.5 of the OpenCL 1.2 specification, Alignment of Types, types are always aligned to their size; so the float8 must be 32 byte aligned. The same section also tells us that float3 takes 4 words. Also, since the sizeof for a struct is arranged to allow arrays of the struct, it won't shrink from reordering these particular fields. On more complex structs you can save space by keeping the smaller fields together.