16 byte alignment algorithm

16 byte alignment algorithm - pointers

I am studying the pytorch codebase and I saw this function that supposedly returns the alignment of a pointer to 16 bytes.
I fail to understand how this algorithm does that. In fact, I tried a few addresses that were already divisible by 16, and the function returns 64 for the alignment, which I think is wrong? Shouldn't it be 0?
Formerly, this function was written as
uint8_t getAlignment(const at::Tensor &t) {
// alignment are in bytes
uint8_t alignment = 1;
uintptr_t address = reinterpret_cast<uintptr_t>(t.data_ptr());
while (address % alignment == 0 && alignment < 16) alignment *= 2;
return alignment;
}
This one also doesn't seem to be correct to me, but this is a bit outside of my territory so I am not sure if I understand correctly.

Well, obviously it can't return 0, since it can only return alignment. Alignment starts of at 1 and is doubled a few times, so it has to be a power of 2.
Any address that's divisible by 64 (aligned to 64 bytes) is also divisible by 16, since 64=164. E.g. if the address would be 1764, then it's also equal to 68*16. And by that logic, anything is aligned to 1.
So when talking about alignment, you have to consider whether the context means "aligned to exactly N, at least N, or at most N". Alignment requirements are typically at least N, actual pointer values are aligned to at most N.
This particular function is none of those three. It tries to serve two functions at once: Is a pointer aligned to at least N=64, but if it's not, then how aligned is it precisely?

Related

I was trying binomial coeffecient problem using dp. i know its not an efficient approach bugt i wamt to know whats this SIGFPE error

Program I wrote. ran it on practice environment of gfg:
class Solution{
public:
int nCr(int n, int r){
// code here
enter code here const unsigned int M = 1000000007;
long long dp[n+1]={0},i,ans;
if(n<r)
return 0;
dp[0]=1;
dp[1]=1;
for(i=2;i<=n;i++){
dp[i]=i*dp[i-1];
}
ans=(dp[n])/(dp[n-r]*dp[r]);
ans=ans%M;
return ans;
}
};
don't really understand what is going on. The division seems to be well defined.

The division seems to be well defined.
You are right suspecting the division as the SIGFPE error origin. As you know, division is well defined as long as the divisor is not zero. At first glance, one wouldn't expect that dp[n-r]*dp[r] could become zero. But the elements of dp have a limited range of values they can hold. With a 64-bit long long, the maximum representable value typically is 263−1 = 9223372036854775807. This means that dp[i] already has overflown for i > 20, though on common processors this overflow is silently ignored. Now, as computing the factorial by multiplication with even higher values of i proceeds, more and more zeros are "shifted in" from the right until eventually all 64 bits are zero; this is on common processors for i = 66 where the exception occurs when n-r or r are equal to or greater than 66.

Whats wrong with my division?

I have a 16-bits sample between -32768 and 32767.
To save space I want to convert it to a 8-bits sample, so I divide the sample by 256, and add 128.
-32768 / 256 = -128 + 128 = 0
32767 / 256 = 127.99 + 128 = 255.99
Now, the 0 will fit perfectly in a byte, but the 255.99 has to be rounded down to 255, causing me to loose precision, because when converting back I'll get 32512 instead of 32767.
How can I do this, without loosing the original min/max values? I know I make a very obvious thought error, but I cant figure out where the mistake lies.
And yes, ofcourse I'm fully aware I lost precision by dividing, and will not be able to deduce the original values from the 8-bit samples, but I just wonder why I don't get the original maximum.

The answers for down-sampling have already been provided.
This answer relates to up-sampling using the full range. Here is a C99 snippet demonstrating how you can spread the error across the full range of your values:
#include <stdio.h>
int main(void)
{
for( int i = 0; i < 256; i++ ) {
unsigned short scaledVal = ((unsigned short)i << 8) + (unsigned short)i;
printf( "%8d%8hu\n", i, scaledVal );
}
return 0;
}
It's quite simple. You shift the value left by 8 and then add the original value back. That means every increase by 1 in the [0,255] range corresponds to an increase by 257 in the [0,65535] range.
I would like to point out that this might give worse results than you began with. For example, if you downsampled 65280 (0xff00) you would get 255, but then upsampling that would give 65535 (0xffff), which is a total error of 255. You will have similarly large errors across most of the higher end of your data range.
You might do better to abandon the notion of going back to the [0,65535] range, and instead round your values by half. That is, shift left and add 127. This means the error is uniform instead of skewed. Because you don't actually know what the original value was, the best you can do is estimate it with a value right in the centre.
To summarize, I think this is more mathematically correct:
unsigned short scaledVal = ((unsigned short)i << 8) + 127;

You don't get the original maximum because you can't represent the number 256 as an 8-bit unsigned integer.

if you're trying to compress your 16 bit integer value into a 8 bit integer value range, you take the most significant 8 bits and keep them while throwing out the least significant 8 bits. Normally this is accomplished by shifting the bits. A >> operator is a shift from most to least significant bits which would work if used 8 times or >>8. You can also just mask out the bytes and divide off the 00s doing your rounding before your division, with something like 8BitInt = (16BitInt & 65280)/256; [65280 a.k.a 0xFF00]
Every bit you shift off of a value halves it, like division by 2, and rounds down.
All of the above is complicated some by the fact that you're dealing with a signed integer.
Finally I'm not 100% certain I got everything right here because really, I haven't tried doing this.

Interesting modulo operation result

On my Windows desktop I multiply two numbers:
var a:Number = 31.05263157894737;
trace(a * 19) // will print '590'
It's obvious that dividing 590 by a leaves a remainder of 0, right? Well for some reason I get a differend result:
trace(590 % a) // will print '31.05263'
My question is How does this happen? Why does 1 % 0.5 give a correct remainder of 0?

31.05263157894737 * 19 is not exactly 590, it's 590.00000000000003
In other words, 590.00000000000003 % 31.05263157894737 = 0, but since 590 is slightly smaller, it will be just slightly less than required to reach/wrap around to 0.
Either way, even if you used what would in source code look as exact numbers will seldom give you exact results in floating point math, since not all numbers can be represented exactly by single/double types, and even tiny rounding errors can (as in this case) give fairly non obvious results.

Does the 6502 use signed or unsigned 8 bit registers (JAVA)?

I'm writing an emulator for the 6502, and basically, there are some instructions where there's an offset saved in one of the registers (mostly X and Y) and I'm wondering, since branch instructions use signed 8 bit integers, do the registers keep their values as 8 bit signed? Meaning this:
switch(opcode) {
//Bunch of opcodes
case 0xD5:
//Read the memory area with final address being address + x offset
int rempResult = a - readMemory(address + x);
//Comparing some things, setting/disabling flags
//Incrementing program counter and cycles/ticks
break;
//More opcodes
}
Let's say in this situation that x = 0xEE. In regular binary, this would mean that x = 238. In the 6502 however, the branch instruction uses signed offset for jumping to memory addresses, so I'm wondering, is the 238 interpreted as -18 in this case, or is it just regular unsigned 8 bit value?

It varies.
They're not explicitly signed or unsigned for arithmetic, logical, shift, or load and store operations.
The conditional branches (and the unconditional one on the later 6502 descendants) all take the argument as signed; otherwise loops would be extremely awkward.
zero, x addressing is achieved by performing an 8-bit addition of x to the zero page address, ignoring carry, and reading from the zero page. So e.g.
LDX #-126 ; which is +130 if unsigned
LDA 23, x
Would read from address 23+130 = 153. But had it been 223+130 then the end read would have been from (223 + 130) MOD 256 = 97.
absolute, x/y is unsigned and carry works correctly (but costs an extra cycle)
(zero, x) is much like the direct version in that the offset is signed but the result is always within the zero page. Then the real address is read from there.
(zero), y is unsigned with carry working and costing.

The "sign" is simply the value of the most significant (aka bit 7) in an 8-bit byte.
6502 has support for signed values in these ways:
The N bit in .P - but it really just tells you if the last instruction turned on or off bit 7 of a memory location or register. It was common to use BPL/BMI to do stuff based on bit 7 in a memory location for flag or "boolean" like use.
The V bit of .P which is flipped "when the result of adding two positive numbers overflows and ends up negative, and when the result of adding two negative numbers overflows and ends up positive"
And of course obeying the sign bit for relative branch instructions only, e.g. BEQ with a value with bit 7 set will move to a lower memory location, not a higher one.
Beyond that, whether that bit means anything is completely up to you and your program. What really makes numbers signed or unsigned is how you display the numbers.
The linked article above goes into what one's complement and two's complement is and how it makes the mathematics work without the 6502 having to care too much about the sign.

Checking get_global_id in OpenCL Kernel Necessary?

I have noticed a number of kernel sources that look like this (found randomly by Googling):
__kernel void fill(__global float* array, unsigned int arrayLength, float val)
{
if(get_global_id(0) < arrayLength)
{
array[get_global_id(0)] = val;
}
}
My question is if that if-statement is actually necessary (assuming that "arrayLength" in this example is the same as the global work size).
In some of the more "professional" kernels I have seen, it is not present. It also seems to me that the hardware would do well to not assign kernels to nonsense coordinates.
However, I also know that processors work in groups. Hence, I can imagine that some processors of a group must do nothing (for example if you have 1 group of size 16, and a work size of 41, then the group would process the first 16 work items, then then next 16, then the next 9, with 7 processors not doing anything--do they get dummy kernels?).
I checked the spec., and the only relevant mention of "get_global_id" is the same as the online documentation, which reads:
The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel.
. . . based how?
So what is it? Is it safe to omit iff the array's size is a multiple of the work group size? What?

You have the right answer already, I think. If the global size of your kernel execution is the same as the array length, then this if statement is useless.
In general, that type of check is only needed for cases where you've partitioned your data in such a way that you know you might execute extra work items relative to your array size. In my experience, you can almost always avoid such cases.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

16 byte alignment algorithm - pointers

Related

I was trying binomial coeffecient problem using dp. i know its not an efficient approach bugt i wamt to know whats this SIGFPE error

Whats wrong with my division?

Interesting modulo operation result

Does the 6502 use signed or unsigned 8 bit registers (JAVA)?

Checking get_global_id in OpenCL Kernel Necessary?

Categories

Resources