x86-64 Big Integer Representation? - math

How do hig-performance native big-integer libraries on x86-64 represent a big integer in memory? (or does it vary? Is there a most common way?)
Naively I was thinking about storing them as 0-terminated strings of numbers in base 264.
For example suppose X is in memory as:
[8 bytes] Dn
.
.
[8 bytes] D2
[8 bytes] D1
[8 bytes] D0
[8 bytes] 0
Let B = 264
Then
X = Dn * Bn + ... + D2 * B2 + D1 * B1 + D0
The empty string (i.e. 8 bytes of zero) means zero.
Is this a reasonable way? What are the pros and cons of this way? Is there a better way?
How would you handle signedness? Does 2's complement work with this variable length value?
(Found this: http://gmplib.org/manual/Integer-Internals.html Whats a limb?)

I would think it would be as an array lowest value to highest. I implemented addition of arbitrary sized numbers in assembler. The CPU provides the carry flag that allows you to easily perform these sorts of operations. You write a loop that performs the operation in byte size chunks. The carry flag is included in the next operation using the "Add with carry" instruction (ADC opcode).

Here I have some examples of processing Big Integers.
Addition
Principle is pretty simple. You need to use CF (carry-flag) for any bigger overflow, with adc (add with carry) propagating that carry between chunks. Let's think about two 128-bit number addition.
num1_lo: dq 1<<63
num1_hi: dq 1<<63
num2_lo: dq 1<<63
num2_hi: dq 1<<62
;Result of addition should be 0xC0000000 0x000000001 0x00000000 0x00000000
mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]
add eax, dword [num2_lo]
adc ebx, dword [num2_lo+4]
adc ecx, dword [num2_hi]
adc edx, dword [num2_hi+4]
; 128-bit integer sum in EDX:ECX:EBX:EAX
jc .overflow ; detect wrapping if you want
You don't need all of it in registers at once; you could store a 32-bit chunk before loading the next, because mov doesn't affect FLAGS. (Looping is trickier, although dec/jnz is usable on modern CPUs which don't have partial-flag stalls for ADC reading CF after dec writes other FLAGS. See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs)
Subtraction
Very similar to addition, although you CF is now called borrow.
mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]
sub eax, dword [num2_lo]
sbb ebx, dword [num2_lo+4]
sbb ecx, dword [num2_hi]
sbb edx, dword [num2_hi+4]
jb .overflow ;or jc
Multiplication
Is much more difficult. You need to multiply each part of first number with each part of second number and add the results. You don't have to multiply only two highest parts that will surely overflow. Pseudocode:
long long int /*128-bit*/ result = 0;
long long int n1 = ;
long long int n2 = ;
#define PART_WIDTH 32 //to be able to manipulate with numbers in 32-bit registers
int i_1 = 0; /*iteration index*/
for(each n-bit wide part of first number : n1_part) {
int i_2 = 0;
for(each n-bit wide part of second number : n2_part) {
result += (n1_part << (i_1*PART_WIDTH))*(n2_part << (i_2*PART_WIDTH));
i_2++;
}
i++;
}
Division
is even more complicated. User Brendan on OsDev.org forum posted example pseudocode for division of n-bit integers. I'm pasting it here because principle is the same.
result = 0;
count = 0;
remainder = numerator;
while(highest_bit_of_divisor_not_set) {
divisor = divisor << 1;
count++;
}
while(remainder != 0) {
if(remainder >= divisor) {
remainder = remainder - divisor;
result = result | (1 << count);
}
if(count == 0) {
break;
}
divisor = divisor >> 1;
count--;
}
Dividing a wide number by a 1-chunk (32 or 64-bit number) can use a sequence of div instructions, using the remainder of the high element as the high half of the dividend for the next lower chunk. See Why should EDX be 0 before using the DIV instruction? for an example of when div is useful with non-zero EDX.
But this doesn't generalize to N-chunk / N-chunk division, hence the above manual shift / subtract algorithm.

Related

segmentiaon fault in x86 assembly implementation of collatz conjecture

I'm trying to write a subroutine that takes in a positive integer and returns the number of steps required for that integer to reach 1 by following the Collatz conjecture. If the input is 1, the output should be zero.
My psudeocode for this is, which is implenting it recursively(we can only use recursion)
int threexplusone(int x){
if(x == 1){
return 0;
}else{
if(x % 2 == 0){
return (threexplusone(x/2)+1);
}else{
return (threexplusone(3*x+1)+1);
}
}
}
My code is like this
threexplusone:
push rbx ;store the rbx to stack
mov rax, 0 ;store the base case
cmp rdi, 1 ;compare the input with 1
je done ;finished the loop if equal to 1
jmp threexplusone_recursive ;jump to the recursion
threexplusone_recursive:
mov rbx, rdi ;move rdi's value to rbx
sar rbx, 1 ;divide x by 2
sub rbx, rdi ;substitute to get the remainder
cmp rbx, 0 ;compare to check if the remainder is 0
je even ;start the even instruction
jne odd ;start the odd insturction
even:
sar rdi,1 ;divided the input by 2
xor rax,rax
call threexplusone ;do the recursion
inc rax ;add the result by one
jmp done ;
odd:
imul rdi,3 ;multiply x by 3
add rdi, 1 ;add x by 1
xor rax, rax
call threexplusone ;do the recursion
inc rax ;add the result by one
jmp done ;
done:
pop rbx
ret
I'm testing it in a cpp file, which
1.Ask for an input value, x, which is the positive integer to pass to the subroutin,
2.Ask for an input value, n, which is the number of times to call the subroutine
3.Run the subroutine once and store the result
4.Run the subroutine n times with the parameter x as the input
5.Print out the number of iterations it took for the integer to converge to 1.
int main(){
int x;
cout << "Enter a number: " << endl;
cin >> x;
int n;
cout << "Enter iterations of subroutine: " << endl;
cin >> n;
int result = threexplusone(x);
for(int i = 0; i < n; i++){
result = threexplusone(x);
}
cout << result << endl;
return 0;
}
The expected result should be
Enter a number:
100
Enter iterations of subroutine:
30
Steps take: 25
Insetead, I have a segmentaiotn fault
Enter a number:
100
Enter iterations of subroutine:
30
Segmentation fault (core dumped)
I step through the debugger, this is what it tells me
Program received signal SIGSEGV, Segmentation fault.
0x000000000040133a in odd ()
So it seems like the problem is in the odd insturction part
odd:
imul rdi,3 ;multiply x by 3
add rdi, 1 ;add x by 1
xor rax, rax
call threexplusone ;do the recursion
inc rax ;add the result by one
jmp done
I'm wondering which part of my assembly could be the cause? Thanks

How to bruteforce a lossy AND routine?

Im wondering whether there are any standard approaches to reversing AND routines by brute force.
For example I have the following transformation:
MOV(eax, 0x5b3e0be0) <- Here we move 0x5b3e0be0 to EDX.
MOV(edx, eax) # Here we copy 0x5b3e0be0 to EAX as well.
SHL(edx, 0x7) # Bitshift 0x5b3e0be0 with 0x7 which results in 0x9f05f000
AND(edx, 0x9d2c5680) # AND 0x9f05f000 with 0x9d2c5680 which results in 0x9d045000
XOR(edx, eax) # XOR 0x9d045000 with original value 0x5b3e0be0 which results in 0xc63a5be0
My question is how to brute force and reverse this routine (i.e. transform 0xc63a5be0 back into 0x5b3e0be0)
One idea i had (which didn't work) was this using PeachPy implementation:
#Input values
MOV(esi, 0xffffffff) < Initial value to AND with, which will be decreased by 1 in a loop.
MOV(cl, 0x1) < Initial value to SHR with which will be increased by 1 until 0x1f.
MOV(eax, 0xc63a5be0) < Target result which I'm looking to get using the below loop.
MOV(edx, 0x5b3e0be0) < Input value which will be transformed.
sub_esi = peachpy.x86_64.Label()
with loop:
#End the loop if ESI = 0x0
TEST(esi, esi)
JZ(loop.end)
#Test the routine and check if it matches end result.
MOV(ebx, eax)
SHR(ebx, cl)
TEST(ebx, ebx)
JZ(sub_esi)
AND(ebx, esi)
XOR(ebx, eax)
CMP(ebx, edx)
JZ(loop.end)
#Add to the CL register which is used for SHR.
#Also check if we've reached the last potential value of CL which is 0x1f
ADD(cl, 0x1)
CMP(cl, 0x1f)
JNZ(loop.begin)
#Decrement ESI by 1, reset CL and restart routine.
peachpy.x86_64.LABEL(sub_esi)
SUB(esi, 0x1)
MOV(cl, 0x1)
JMP(loop.begin)
#The ESI result here will either be 0x0 or a valid value to AND with and get the necessary result.
RETURN(esi)
Maybe an article or a book you can recommend specific to this?
It's not lossy, the final operation is an XOR.
The whole routine can be modeled in C as
#define K 0x9d2c5680
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
Now, if we have two bits x and y and the operation x XOR y, when y is zero the result is x.
So given two numbers n1 and n2 and considering their XOR, the bits or n1 that pairs with a zero in n2 would make it to the result unchanged (the others will be flipped).
So in considering num ^ ( (num << 7) & K) we can identify num with n1 and (num << 7) & K with n2.
Since n2 is an AND, we can tell that it must have at least the same zero bits that K has.
This means that each bit of num that corresponds to a zero bit in the constant K will make it unchanged into the result.
Thus, by extracting those bits from the result we already have a partial inverse function:
/*hash & ~K extracts the bits of hash that pair with a zero bit in K*/
partial_num = hash & ~K
Technically, the factor num << 7 would also introduce other zeros in the result of the AND. We know for sure that the lowest 7 bits must be zero.
However K already has the lowest 7 bits zero, so we cannot exploit this information.
So we will just use K here, but if its value were different you'd need to consider the AND (which, in practice, means to zero the lower 7 bits of K).
This leaves us with 13 bits unknown (the ones corresponding to the bits that are set in K).
If we forget about the AND for a moment, we would have x ^ (x << 7) meaning that
hi = numi for i from 0 to 6 inclusive
hi = numi ^ numi-7 for i from 7 to 31 inclusive
(The first line is due to the fact that the lower 7 bits of the right-hand are zero)
From this, starting from h7 and going up, we can retrive num7 as h7 ^ num0 = h7 ^ h0.
From bit 7 onward, the equality doesn't work and we need to use numk (for the suitable k) but luckily we already have computed its value in a previous step (that's why we start from lower to higher).
What the AND does to this is just restricting the values the index i runs in, specifically only to the bits that are set in K.
So to fill in the thirteen remaining bits one have to do:
part_num7 = h7 ^ part_num0
part_num9 = h9 ^ part_num2
part_num12 = h12 ^ part_num5
...
part_num31 = h31 ^ part_num24
Note that we exploited that fact that part_num0..6 = h0..6.
Here's a C program that inverts the function:
#include <stdio.h>
#include <stdint.h>
#define BIT(i, hash, result) ( (((result >> i) ^ (hash >> (i+7))) & 0x1) << (i+7) )
#define K 0x9d2c5680
uint32_t base_candidate(uint32_t hash)
{
uint32_t result = hash & ~K;
result |= BIT(0, hash, result);
result |= BIT(2, hash, result);
result |= BIT(3, hash, result);
result |= BIT(5, hash, result);
result |= BIT(7, hash, result);
result |= BIT(11, hash, result);
result |= BIT(12, hash, result);
result |= BIT(14, hash, result);
result |= BIT(17, hash, result);
result |= BIT(19, hash, result);
result |= BIT(20, hash, result);
result |= BIT(21, hash, result);
result |= BIT(24, hash, result);
return result;
}
uint32_t hash(uint32_t num)
{
return num ^ ( (num << 7) & K);
}
int main()
{
uint32_t tester = 0x5b3e0be0;
uint32_t candidate = base_candidate(hash(tester));
printf("candidate: %x, tester %x\n", candidate, tester);
return 0;
}
Since the original question was how to "bruteforce" instead of solve here's something that I eventually came up with which works just as well. Obviously its prone to errors depending on input (might be more than 1 result).
from peachpy import *
from peachpy.x86_64 import *
input = 0xc63a5be0
x = Argument(uint32_t)
with Function("DotProduct", (x,), uint32_t) as asm_function:
LOAD.ARGUMENT(edx, x) # EDX = 1b6fb67c
MOV(esi, 0xffffffff)
with Loop() as loop:
TEST(esi,esi)
JZ(loop.end)
MOV(eax, esi)
SHL(eax, 0x7)
AND(eax, 0x9d2c5680)
XOR(eax, esi)
CMP(eax, edx)
JZ(loop.end)
SUB(esi, 0x1)
JMP(loop.begin)
RETURN(esi)
#Read Assembler Return
abi = peachpy.x86_64.abi.detect()
encoded_function = asm_function.finalize(abi).encode()
python_function = encoded_function.load()
print(hex(python_function(input)))

Is it valid to add an entire array of bytes at once by converting them to a larger integer data type?

If I have two arrays that contain u8s, can I convert them into a larger integer type to reduce the number of additions I need to do? For example, if two byte arrays each contain 4 bytes, can I make them each into a u32, do the addition, and then convert them back?
For example:
let a = u32::from_ne_bytes([1, 2, 3, 4]);
let b = u32::from_ne_bytes([5, 6, 7, 8]);
let c = a + b;
let c_bytes = u32::to_ne_bytes(c);
assert_eq!(c_bytes, [6, 8, 10, 12]);
This example results in the correct output.
Does this always result in the right output (assuming there is no overflow)?
Is this faster than just doing the additions individually?
Does it hold true for other integer types? Such as 2 u16s in a u32 added with 2 other u16s in a u32?
If this exists and is common, what is it called?
Does this always result in the right output (assuming there is no overflow)?
Yes. Provided that each sum is less than 256, this will add the bytes as you want. You've specified "ne" in each case, for native endianness. This will work, regardless of the native endianness because the operations are byte-wise.
If you wrote code to actually check that the sums are all in range, then you would almost certainly undo any extra speed-up that you had got (if there was any to begin with).
Is this faster than just doing the additions individually?
Maybe. The only way to know for sure is to test.
Does it hold true for other integer types? Such as 2 u16s in a u32 added with 2 other u16s in a u32?
Yes, but you need to pay attention to byte order.
If this exists and is common, what is it called?
It's not common because it's usually unnecessary. This type of optimisation makes code harder to read and introduces considerable complexity and opportunities for bugs. The Rust compiler and LLVM between them are able to find extremely sophisticated optimisations, that you would never think of, while your code stays readable and maintainable.
If it has a name, it's SIMD, and most modern processor support a form of it natively (SSE, MMX, AVX). You can do this manually, using the built-in functions, e.g. core::arch::x86_64::_mm_add_epi8, but LLVM might do it automatically. It's possible that trying to do this manually could interfere with optimisations that LLVM would otherwise do, while making your code more bug-prone at the same time.
I'm not an expert at assembly code by any means, but I took at a look at the assembly generated for the following two functions:
#[no_mangle]
#[inline(never)]
pub fn f1(a1: u8, b1: u8, c1: u8, d1: u8, a2: u8, b2: u8, c2: u8, d2: u8) -> [u8; 4]{
let a = u32::from_le_bytes([a1, b1, c1, d1]);
let b = u32::from_le_bytes([a2, b2, c2, d2]);
u32::to_le_bytes(a + b)
}
#[no_mangle]
#[inline(never)]
pub fn f2(a1: u8, b1: u8, c1: u8, d1: u8, a2: u8, b2: u8, c2: u8, d2: u8) -> [u8; 4]{
[a1 + a2, b1 + b2, c1 + c2, d1 + d2]
}
The assembly for f1:
movzx r10d, byte ptr [rsp + 8]
shl ecx, 24
movzx eax, dl
shl eax, 16
movzx edx, sil
shl edx, 8
movzx esi, dil
or esi, edx
or esi, eax
or esi, ecx
mov ecx, dword ptr [rsp + 16]
shl ecx, 24
shl r10d, 16
movzx edx, r9b
shl edx, 8
movzx eax, r8b
or eax, edx
or eax, r10d
or eax, ecx
add eax, esi
ret
And for f2:
add r8b, dil
add r9b, sil
add dl, byte ptr [rsp + 8]
add cl, byte ptr [rsp + 16]
movzx ecx, cl
shl ecx, 24
movzx edx, dl
shl edx, 16
movzx esi, r9b
shl esi, 8
movzx eax, r8b
or eax, esi
or eax, edx
or eax, ecx
ret
Fewer instructions doesn't necessarily make it faster, but it's not a bad guideline.
Consider this kind of optimisation as a last resort, after careful measurement and testing.

In my assembly program, I am trying to calculate the equation of (((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)

In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.
Any and all advice will be appreciated.
.586
.model flat
include io.h
.stack 4096
.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?
.code
_MainProc proc
input prompt, string, 40
atod string
push eax
call power
add esp, 4
dtoa result, eax
output lbl_msg, result
mov eax, 0
ret
_MainProc endp
power proc
push ebp
mov ebp, esp
push ecx
mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]
jecxz done
loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1
oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1
done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret
power endp
end
You're overcomplicating your code with static variables and branching.
These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.
add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.
If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.
As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.
There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.
You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).
This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.
We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.
This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)
;; UNTESTED
power proc ; fastcall calling convention: arg: ECX = unsigned int n
; clobbers: ECX, EDX
; returns: EAX
push ebx ; save a call-preserved register for scratch space
mov eax, 1 ; EAX = 2^0 running total / return value
test ecx,ecx
jz done
mov edx, ecx ; EDX = n
mov ecx, 1 ; ECX = i=1..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; add 2^odd power
mov ebx, 1
shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
add eax, ebx ; total += 2^i
inc ecx
cmp ecx, edx
jae done ; if (++i >= n) break;
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
pop ebx
ret
I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.
To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.
Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.
Using calc:
; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
shiftadd_power(n) defined
; base2(2)
; shiftadd_power(0)
1 /* 1 */
...
The first few outputs are:
n shiftadd(n) (base2)
0 1
1 11
2 1100
3 10100 ; 1100 + 1000 carries
4 101000000
5 101100000 ; 101000000 + 100000 set a bit that was previously 0
6 101100000000000
7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD
Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.
Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.
Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.
For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)
;; UNTESTED
;; special cases for n<3, and for n>=18
;; enabling an optimization in the main loop (BTS instead of add)
;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
power_optimized proc
; fastcall calling convention: arg: ECX = unsigned int n <= 31
; clobbers: ECX, EDX
; returns: EAX
mov eax, 14h ; 0b10100 = power(3)
cmp ecx, 3
ja n_gt_3 ; goto main loop or fall through to hard-coded low n
je early_ret
;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)
mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
setbe cl ; count=0,1,2 => cl=1,1,0
adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
shr eax, cl
early_ret:
ret
large_n: ; odd n: result = 1<<n. even n: result = 0
mov eax, ecx
and eax, 1 ; n&1
shl eax, cl ; n>31 will wrap the shift count so this "fails"
ret ; if you need to return 0 for all n>31, add another check
n_gt_3:
;; eax = running total for i=3 already
cmp ecx, 18
jae large_n
mov edx, ecx ; EDX = n
mov ecx, 4 ; ECX = i=4..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc edx
cmp ecx, edx
jae done ; if (++i >= n) break;
; add 2^odd power. i>3 so it won't already be set (thus no carry)
bts eax, edx ; total |= 1<<i;
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
ret
By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.
Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.
I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.

How to interpret a binary integer as ternary (base 3)?

My CPU register contains a binary integer 0101, equal to the decimal number 5:
0101 ( 4 + 1 = 5 )
I want the register to contain instead the binary integer equal to decimal 10, as if the original binary number 0101 were ternary (base 3) and every digit happens to be either 0 or 1:
0101 ( 9 + 1 = 10 )
How can i do this on a contemporary CPU or GPU with 1. the fewest memory reads and 2. the fewest hardware instructions?
Use an accumulator. C-ish Pseudocode:
var accumulator = 0
foreach digit in string
accumulator = accumulator * 3 + (digit - '0')
return accumulator
To speed up the multiply by 3, you might use ((accumulator << 1) + accumulator), but a good compiler will be able to do that for you.
If a large percentage of your numbers are within a relatively small range, you can also pregenerate a lookup table to make the transformation from base2 to base3 instantaneous (using the base2 value as the index). You can also use the lookup table to accelerate lookup of the first N digits, so you only pay for the conversion of the remaining digits.
This C program will do it:
#include <stdio.h>
main()
{
int binary = 5000; //Example
int ternary = 0;
int po3 = 1;
do
{
ternary += (binary & 1) * po3;
po3 *= 3;
}
while (binary >>= 1 != 0);
printf("%d\n",ternary);
}
The loop compiles into this machine code on my 32-bit Intel machine:
do
{
ternary += (binary & 1) * po3;
0041BB33 mov eax,dword ptr [binary]
0041BB36 and eax,1
0041BB39 imul eax,dword ptr [po3]
0041BB3D add eax,dword ptr [ternary]
0041BB40 mov dword ptr [ternary],eax
po3 *= 3;
0041BB43 mov eax,dword ptr [po3]
0041BB46 imul eax,eax,3
0041BB49 mov dword ptr [po3],eax
}
while (binary >>= 1 != 0);
0041BB4C mov eax,dword ptr [binary]
0041BB4F sar eax,1
0041BB51 mov dword ptr [binary],eax
0041BB54 jne main+33h (41BB33h)
For the example value (decimal 5000 = binary 1001110001000), the ternary value it produces is 559899.

Resources