How to declare a global dword pointer in x86? - pointers

I need to declare two global dword pointers that point to two arrays. To my knowledge, global means declare in .data. Is that correct? What is the code to declare these dword pointers so that the are initialized to 0 in x86 assembly with NASM and Intel syntax?

I need to declare two global dword pointers that point to two arrays.
That's simple if i understand correctly, assuming they mean global as in to all files, make a file called pointers.asm or something and type:
[section] .bss
global _pointer1, _pointer2
_pointer1 resd 4 ; reserve 4 dwords
_pointer2 resd 4
We use a .bss section because that memory is set to zero so when you use it your variables are 0 initialized
Or you could just use a .data section if you want and initialize each element to zero yourself:
[section] .data
global _pointer1, _pointer2
_pointer1 dd 0,0,0,0
_pointer2 dd 0,0,0,0
Also still using a .data section, it could be done like this allowing you to specify the size of a buffer like with the .bss section:
[section] .data
global _pointer1, _pointer2
_pointer1 times 4 dd 0
_pointer2 times 4 dd 0
Regardless of the way you decide to do it, to use a pointer to an array declared globally in a separate file:
[section] .text
global _main
extern _pointer1
bytesize equ 4 ; dword = 4 bytes
_main:
; write each element of the array
mov dword [_pointer1 + 0 * bytesize], 0xa
mov dword [_pointer1 + 1 * bytesize], 0xb
mov dword [_pointer1 + 2 * bytesize], 0xc
mov dword [_pointer1 + 3 * bytesize], 0xd
; read each element of the array
mov eax, [_pointer1 + 0 * bytesize]
mov eax, [_pointer1 + 1 * bytesize]
mov eax, [_pointer1 + 2 * bytesize]
mov eax, [_pointer1 + 3 * bytesize]
ret
This main program returns with 0xd or 13 stored in eax, hopefully by looking at this you can get a grasp what is going on.

Related

How to Add Numbers in Assembly Language without using a loop?

I need this program to print out 1 + 3 + 4 + 10 = 18, but I haven't been successful so far. I can make it to print out 18 alone, but this is not what I am asked to do. I am not allowed to use loops. Could anyone help me with that?
INCLUDE Irvine32.inc
.data
y1 DWORD 1
y2 DWORD 3
y3 DWORD 4
y4 DWORD 10
plus byte " + ",0
equal byte " = ",0
.code
main PROC
exit
main ENDP
END main
OK, it took my entire last night to figure out, but this works.
INCLUDE Irvine32.inc ; like import
.data
y1 dword 1
y2 dword 3
y3 dword 4
y4 dword 10
plus byte " + ",0
equal byte " = ",0;
.code
main PROC
mov eax,0
mov edx,offset plus
mov ebx,0
mov eax,y1
call writedec
add ebx,eax
call writestring
mov eax,y2
call writedec
add ebx,eax
call writestring
mov eax,y3
call writedec
add ebx,eax
call writestring
mov eax,y4
call writedec
add ebx,eax
mov edx,offset equal
call writestring
mov eax,ebx
call writedec
exit
main ENDP
end main

In my assembly program, I am trying to calculate the equation of (((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)

In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.
Any and all advice will be appreciated.
.586
.model flat
include io.h
.stack 4096
.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?
.code
_MainProc proc
input prompt, string, 40
atod string
push eax
call power
add esp, 4
dtoa result, eax
output lbl_msg, result
mov eax, 0
ret
_MainProc endp
power proc
push ebp
mov ebp, esp
push ecx
mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]
jecxz done
loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1
oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1
done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret
power endp
end
You're overcomplicating your code with static variables and branching.
These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.
add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.
If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.
As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.
There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.
You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).
This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.
We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.
This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)
;; UNTESTED
power proc ; fastcall calling convention: arg: ECX = unsigned int n
; clobbers: ECX, EDX
; returns: EAX
push ebx ; save a call-preserved register for scratch space
mov eax, 1 ; EAX = 2^0 running total / return value
test ecx,ecx
jz done
mov edx, ecx ; EDX = n
mov ecx, 1 ; ECX = i=1..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; add 2^odd power
mov ebx, 1
shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
add eax, ebx ; total += 2^i
inc ecx
cmp ecx, edx
jae done ; if (++i >= n) break;
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
pop ebx
ret
I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.
To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.
Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.
Using calc:
; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
shiftadd_power(n) defined
; base2(2)
; shiftadd_power(0)
1 /* 1 */
...
The first few outputs are:
n shiftadd(n) (base2)
0 1
1 11
2 1100
3 10100 ; 1100 + 1000 carries
4 101000000
5 101100000 ; 101000000 + 100000 set a bit that was previously 0
6 101100000000000
7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD
Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.
Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.
Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.
For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)
;; UNTESTED
;; special cases for n<3, and for n>=18
;; enabling an optimization in the main loop (BTS instead of add)
;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
power_optimized proc
; fastcall calling convention: arg: ECX = unsigned int n <= 31
; clobbers: ECX, EDX
; returns: EAX
mov eax, 14h ; 0b10100 = power(3)
cmp ecx, 3
ja n_gt_3 ; goto main loop or fall through to hard-coded low n
je early_ret
;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)
mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
setbe cl ; count=0,1,2 => cl=1,1,0
adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
shr eax, cl
early_ret:
ret
large_n: ; odd n: result = 1<<n. even n: result = 0
mov eax, ecx
and eax, 1 ; n&1
shl eax, cl ; n>31 will wrap the shift count so this "fails"
ret ; if you need to return 0 for all n>31, add another check
n_gt_3:
;; eax = running total for i=3 already
cmp ecx, 18
jae large_n
mov edx, ecx ; EDX = n
mov ecx, 4 ; ECX = i=4..n loop counter and shift count
loop1: ; do{ // unrolled by 2
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)
inc edx
cmp ecx, edx
jae done ; if (++i >= n) break;
; add 2^odd power. i>3 so it won't already be set (thus no carry)
bts eax, edx ; total |= 1<<i;
inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);
done:
ret
By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.
Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.
I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.

Volume of cone not returning correct value

My code assembles correctly, however it is not returning the volume of a cone. I tried a few other ways to get it to give me the volume of a cone, and it doesn't give me the correct answer. Is there something I am missing?
Edit: I fixed my code to output a number, however its not giving me the right number.
This is the formula I am using for this code:
V= (22)(r)(r)(h)/21
diviser DWORD 21
height WORD 0
radius BYTE 0 ;*** Declare an unsigned, integer variable for the value
product WORD 0
askRadius BYTE "What is the radius?", 0ah, 0dh, 0 ;*** Declare the prompt and message parts
askHeight BYTE "What is the height? ", 0
volumeOutput BYTE "The volume of the cone is: ", 0
lineBreak BYTE " ",0dh, 0ah, 0
.code
main PROC
mov ebx, diviser ;Initializes the diviser
mov edx,OFFSET askRadius ;Asks for Radius
call WriteString
call ReadDec
mov radius, al ;moves radius into al (8-bit)
mul radius ;Multiplies radius * radius(16-bit)
mov edx,OFFSET askHeight ;asks for Height
call writestring
call ReadDec
mov height, dx ;moves height into AX
mov WORD PTR product, ax ;convert 16-bit into 32-bit
mov WORD PTR product+2, dx ;converts 16-bit into 32-bit
mul eax ;multiplies by 22
mov edx, 22
mul edx
div ebx ;divides by 21
mov edx, OFFSET volumeOutput
call WriteString
call WriteDec
call WaitMsg
exit
main ENDP
END main
Something like this should work better. Note you should choose where you want to widen the calculation to 64 bit.
askRadius BYTE "What is the radius?", 0ah, 0dh, 0 ;*** Declare the prompt and message parts
askHeight BYTE "What is the height? ", 0
volumeOutput BYTE "The volume of the cone is: ", 0
lineBreak BYTE " ",0dh, 0ah, 0
.code
main PROC
mov edx,OFFSET askRadius ;Asks for Radius
call WriteString
call ReadDec
imul eax, eax ; radius * radius
mov ecx, eax ; save for later
mov edx,OFFSET askHeight ;asks for Height
call writestring
call ReadDec
imul eax, eax, 22
mul ecx ;multiplies by radius * radius
mov ecx, 21
div ecx ;divides by 21
mov edx, OFFSET volumeOutput
call WriteString
call WriteDec
call WaitMsg
exit
main ENDP
END main

x86-64 Big Integer Representation?

How do hig-performance native big-integer libraries on x86-64 represent a big integer in memory? (or does it vary? Is there a most common way?)
Naively I was thinking about storing them as 0-terminated strings of numbers in base 264.
For example suppose X is in memory as:
[8 bytes] Dn
.
.
[8 bytes] D2
[8 bytes] D1
[8 bytes] D0
[8 bytes] 0
Let B = 264
Then
X = Dn * Bn + ... + D2 * B2 + D1 * B1 + D0
The empty string (i.e. 8 bytes of zero) means zero.
Is this a reasonable way? What are the pros and cons of this way? Is there a better way?
How would you handle signedness? Does 2's complement work with this variable length value?
(Found this: http://gmplib.org/manual/Integer-Internals.html Whats a limb?)
I would think it would be as an array lowest value to highest. I implemented addition of arbitrary sized numbers in assembler. The CPU provides the carry flag that allows you to easily perform these sorts of operations. You write a loop that performs the operation in byte size chunks. The carry flag is included in the next operation using the "Add with carry" instruction (ADC opcode).
Here I have some examples of processing Big Integers.
Addition
Principle is pretty simple. You need to use CF (carry-flag) for any bigger overflow, with adc (add with carry) propagating that carry between chunks. Let's think about two 128-bit number addition.
num1_lo: dq 1<<63
num1_hi: dq 1<<63
num2_lo: dq 1<<63
num2_hi: dq 1<<62
;Result of addition should be 0xC0000000 0x000000001 0x00000000 0x00000000
mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]
add eax, dword [num2_lo]
adc ebx, dword [num2_lo+4]
adc ecx, dword [num2_hi]
adc edx, dword [num2_hi+4]
; 128-bit integer sum in EDX:ECX:EBX:EAX
jc .overflow ; detect wrapping if you want
You don't need all of it in registers at once; you could store a 32-bit chunk before loading the next, because mov doesn't affect FLAGS. (Looping is trickier, although dec/jnz is usable on modern CPUs which don't have partial-flag stalls for ADC reading CF after dec writes other FLAGS. See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs)
Subtraction
Very similar to addition, although you CF is now called borrow.
mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]
sub eax, dword [num2_lo]
sbb ebx, dword [num2_lo+4]
sbb ecx, dword [num2_hi]
sbb edx, dword [num2_hi+4]
jb .overflow ;or jc
Multiplication
Is much more difficult. You need to multiply each part of first number with each part of second number and add the results. You don't have to multiply only two highest parts that will surely overflow. Pseudocode:
long long int /*128-bit*/ result = 0;
long long int n1 = ;
long long int n2 = ;
#define PART_WIDTH 32 //to be able to manipulate with numbers in 32-bit registers
int i_1 = 0; /*iteration index*/
for(each n-bit wide part of first number : n1_part) {
int i_2 = 0;
for(each n-bit wide part of second number : n2_part) {
result += (n1_part << (i_1*PART_WIDTH))*(n2_part << (i_2*PART_WIDTH));
i_2++;
}
i++;
}
Division
is even more complicated. User Brendan on OsDev.org forum posted example pseudocode for division of n-bit integers. I'm pasting it here because principle is the same.
result = 0;
count = 0;
remainder = numerator;
while(highest_bit_of_divisor_not_set) {
divisor = divisor << 1;
count++;
}
while(remainder != 0) {
if(remainder >= divisor) {
remainder = remainder - divisor;
result = result | (1 << count);
}
if(count == 0) {
break;
}
divisor = divisor >> 1;
count--;
}
Dividing a wide number by a 1-chunk (32 or 64-bit number) can use a sequence of div instructions, using the remainder of the high element as the high half of the dividend for the next lower chunk. See Why should EDX be 0 before using the DIV instruction? for an example of when div is useful with non-zero EDX.
But this doesn't generalize to N-chunk / N-chunk division, hence the above manual shift / subtract algorithm.

How to interpret a binary integer as ternary (base 3)?

My CPU register contains a binary integer 0101, equal to the decimal number 5:
0101 ( 4 + 1 = 5 )
I want the register to contain instead the binary integer equal to decimal 10, as if the original binary number 0101 were ternary (base 3) and every digit happens to be either 0 or 1:
0101 ( 9 + 1 = 10 )
How can i do this on a contemporary CPU or GPU with 1. the fewest memory reads and 2. the fewest hardware instructions?
Use an accumulator. C-ish Pseudocode:
var accumulator = 0
foreach digit in string
accumulator = accumulator * 3 + (digit - '0')
return accumulator
To speed up the multiply by 3, you might use ((accumulator << 1) + accumulator), but a good compiler will be able to do that for you.
If a large percentage of your numbers are within a relatively small range, you can also pregenerate a lookup table to make the transformation from base2 to base3 instantaneous (using the base2 value as the index). You can also use the lookup table to accelerate lookup of the first N digits, so you only pay for the conversion of the remaining digits.
This C program will do it:
#include <stdio.h>
main()
{
int binary = 5000; //Example
int ternary = 0;
int po3 = 1;
do
{
ternary += (binary & 1) * po3;
po3 *= 3;
}
while (binary >>= 1 != 0);
printf("%d\n",ternary);
}
The loop compiles into this machine code on my 32-bit Intel machine:
do
{
ternary += (binary & 1) * po3;
0041BB33 mov eax,dword ptr [binary]
0041BB36 and eax,1
0041BB39 imul eax,dword ptr [po3]
0041BB3D add eax,dword ptr [ternary]
0041BB40 mov dword ptr [ternary],eax
po3 *= 3;
0041BB43 mov eax,dword ptr [po3]
0041BB46 imul eax,eax,3
0041BB49 mov dword ptr [po3],eax
}
while (binary >>= 1 != 0);
0041BB4C mov eax,dword ptr [binary]
0041BB4F sar eax,1
0041BB51 mov dword ptr [binary],eax
0041BB54 jne main+33h (41BB33h)
For the example value (decimal 5000 = binary 1001110001000), the ternary value it produces is 559899.

Resources