Vector Add Scalar Single Precision - intel

I was reading Avx Scalar Floating-Point Instructions and in them, I faced some doubt. Consider this instruction.
See that we are adding the lower 32 bits of xmm1 and xmm2 registers and storing them in the xmm0 register. Now I have a doubt here. Say all the lower 31 bits is 0 and the MSB(of the lower 32 bits) is 1 in both the registers. Like 1000..00 for xmm1 as well as 1000..00 for xmm2 for 32 bits. Now if we add them the value of the from xmm0[31:0] becomes all zero but xmm0[32] bit becomes 1. But here in the addition, we are not storing that 1, we are just replacing the xmm0[127:32] with xmm1[127:32]. Isn't it wrong?
Moreover when we are adding bits in parallel how is carry propagated? Do we use carry look ahead adder in these cases?

Related

What is the max number that you can take by 27 in 32 bits

I am doing something in coding where it multiplies a number by from 1 to 27. I need to make a fail safe, where no number can be over this. Rounding to 2^32/2^64 would not work. It needs so be 32 bits so it can both support 32 and 64 bit OS's.
If you want to multiply 3 by 5, but know the maximum allowed result is 10, you can easily tell that 3 is too large because 3 > 10/5. That’s all there’s to it :)
Since you insist on using a 32-bit type, and I assume your programming language is C, the maximum value int32_t can represent is INT32_MAX - those two come from #include <stdint.h>.
But you may be mistaken in your assumption about being limited to 32-bit types: int64_t works on most if not all major 32-bit platforms :)
(2^32/27-1) will not give you the correct upper value. As an integer it is 159072861 which is one too low.
The maximum integer value that can be stored in 32 bits is 2^32 - 1 which works out as 4294967295.
So the maximum value is actually (2^32 - 1) // 27 which is 159072862.
Note the use of integer division which I assume is what you want.

Compute sum of bits efficiently with SSE

I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.
I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.
In lines 80-105 I make this sum by using 64-bit arithmetic.
In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.
I am running this on an Intel(R) i7-4980HQ CPU # 2.80GHz with gcc 7.2.0.
Do you know a way around this?
The low part can be trivially saved with movq instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a); intrinsic storing to memory, or _mm_cvtsi128_si64 storing to register.
There is also a _mm_storeh_pd counterpart, which requires cast to pd and may cause a stall due to mixing floating points and integers.
The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e) and then saved with movq.

OpenCL and AMD Gpu Architecture understanding

So I was reading the architecture for GCN 1st Generation GPUs provided by the paper here, and I'm a bit confused on the size of the vector ALUs and some other things.
1) According to it, each compute unit has 1 scalar unit and 4 SIMDs. Each of these 4 SIMDs have 16 ALUs to perform vector operations. The paper states that the ALUs natively execute single precision floating point and 24-bit integer at full speed and DP and 32 bit integer at reduced speeds.
What I want to know is why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right?
2) Secondly we know that for AMD GCN GPUs, each SIMD array executes 1-quarter of wavefront over 4 cycles. When an instruction is assigned to an SIMD unit, does it replicates across all 4? or does it take 4 different cycles in order for each SIMD unit to get an instruction?
If all 4 SIMD units execute the same instruction then, theoretically this gets us 4 wavefronts per 4 cycles. If it's the second case then only 1 wavefront gets completed at the 4th cycle.
Although note that according to the GCN whitepaper, the Local Data Share (LDS) coalesces 16 lanes from 2 different SIMD units each cycle so this gets us 2 complete wavefronts per 4 cycles. This seems to hint that it's the first case, since there is no way to get more than 1 wavefront completed per 4 cycles if the instructions aren't replicated across SIMD units.
3) Lastly I want to ask about a scenario.
Suppose I have a 2D workgroup assigned to a Compute Unit. The workgroup consists of 8x8 = 64 work items. Will the compute unit form 1 wavefront and execute this over 4 cycles in 1 SIMD unit, while the other 3 SIMD units remain idle? Or will something else happen?
why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right?
If you look at how 32-bit floats are represented, you'll notice there is a 24-bit mantissa, a sign bit, and 7 bits of exponent. Presumably the GPU can use the floating-point ALU's capabilities directly to operate on 24-bit integers stored in what would normally be the mantissa. For operating on larger integers, explicit long multiplication of some kind will need to be done (like 64-bit arithmetic on a 32-bit CPU), slowing things down.

Does the 6502 use signed or unsigned 8 bit registers (JAVA)?

I'm writing an emulator for the 6502, and basically, there are some instructions where there's an offset saved in one of the registers (mostly X and Y) and I'm wondering, since branch instructions use signed 8 bit integers, do the registers keep their values as 8 bit signed? Meaning this:
switch(opcode) {
//Bunch of opcodes
case 0xD5:
//Read the memory area with final address being address + x offset
int rempResult = a - readMemory(address + x);
//Comparing some things, setting/disabling flags
//Incrementing program counter and cycles/ticks
break;
//More opcodes
}
Let's say in this situation that x = 0xEE. In regular binary, this would mean that x = 238. In the 6502 however, the branch instruction uses signed offset for jumping to memory addresses, so I'm wondering, is the 238 interpreted as -18 in this case, or is it just regular unsigned 8 bit value?
It varies.
They're not explicitly signed or unsigned for arithmetic, logical, shift, or load and store operations.
The conditional branches (and the unconditional one on the later 6502 descendants) all take the argument as signed; otherwise loops would be extremely awkward.
zero, x addressing is achieved by performing an 8-bit addition of x to the zero page address, ignoring carry, and reading from the zero page. So e.g.
LDX #-126 ; which is +130 if unsigned
LDA 23, x
Would read from address 23+130 = 153. But had it been 223+130 then the end read would have been from (223 + 130) MOD 256 = 97.
absolute, x/y is unsigned and carry works correctly (but costs an extra cycle)
(zero, x) is much like the direct version in that the offset is signed but the result is always within the zero page. Then the real address is read from there.
(zero), y is unsigned with carry working and costing.
The "sign" is simply the value of the most significant (aka bit 7) in an 8-bit byte.
6502 has support for signed values in these ways:
The N bit in .P - but it really just tells you if the last instruction turned on or off bit 7 of a memory location or register. It was common to use BPL/BMI to do stuff based on bit 7 in a memory location for flag or "boolean" like use.
The V bit of .P which is flipped "when the result of adding two positive numbers overflows and ends up negative, and when the result of adding two negative numbers overflows and ends up positive"
And of course obeying the sign bit for relative branch instructions only, e.g. BEQ with a value with bit 7 set will move to a lower memory location, not a higher one.
Beyond that, whether that bit means anything is completely up to you and your program. What really makes numbers signed or unsigned is how you display the numbers.
The linked article above goes into what one's complement and two's complement is and how it makes the mathematics work without the 6502 having to care too much about the sign.

Assembler memory address representation

I'm trying to get into assembler and I often come across numbers in the following form:
org 7c00h
; initialize the stack:
mov ax, 07c0h
mov ss, ax
mov sp, 03feh ; top of the stack.
7c00h, 07c0h, 03feh - What is the name of this number notation? What do they mean? Why are they used over "normal" decimal numbers?
It's hexadecimal, the numeral system with 16 digits 0-9 and A-F. Memory addresses are given in hex, because it's shorter, easier to read, and the numbers that represent memory locations don't mean anything special to humans, so no sense to have long numbers. I would guess that somewhere in the past someone had to type in some addresses by hand as well, might as well have started there.
Worth noting also, 0:7C00 is the boot sector load address.
Further worth noting: 07C0:03FE is the same address as 0:7FFE due to the way segmented addressing works.
This guy's left himself a 510 byte stack (he made the very typical off-by-two error in setting up the boot sector's stack).
These are numbers in hexadecimal notation, i.e. in base 16, where A to F have the digit values 10 to 15.
One advantage is that there is a more direct conversion to binary numbers. With a little bit of practice it is easy to see which bits in the number are 1 and which are 0.
Another is is that many numbers used internally, such as memory addresses, are round numbers in hexadecimal, i.e. contain a lot of zeros.

Resources