Any ideas? I'm using the GCC cross-compiler for a PPC750. Doing a simple multiply operation of two floating-point numbers in a loop and timing it. I declared the variables to be volatile to make sure nothing important was optimized out, and the code sped up!
I've inspected the assembly instructions for both cases and, sure enough, the compiler generated many more instructions to do the same basic job in the non-volatile case. Execution time for 10,000,000 iterations dropped from 800ms to 300ms!
assembly for volatile case:
0x10eeec stwu r1,-32(r1)
0x10eef0 lis r9,0x1d # 29
0x10eef4 lis r11,0x4080 # 16512
0x10eef8 lfs fr0,-18944(r9)
0x10eefc li r0,0x0 # 0
0x10ef00 lis r9,0x98 # 152
0x10ef04 stfs fr0,8(r1)
0x10ef08 mtspr CTR,r9
0x10ef0c stw r11,12(r1)
0x10ef10 stw r0,16(r1)
0x10ef14 ori r9,r9,0x9680
0x10ef18 mtspr CTR,r9
0x10ef1c lfs fr0,8(r1)
0x10ef20 lfs fr13,12(r1)
0x10ef24 fmuls fr0,fr0,fr13
0x10ef28 stfs fr0,16(r1)
0x10ef2c bc 0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30 addi r1,r1,0x20 # 32
asssembly for non-volatile case:
0x10ef04 stwu r1,-48(r1)
0x10ef08 stw r31,44(r1)
0x10ef0c or r31,r1,r1
0x10ef10 lis r9,0x1d # 29
0x10ef14 lfs fr0,-18832(r9)
0x10ef18 stfs fr0,12(r31)
0x10ef1c lis r0,0x4080 # 16512
0x10ef20 stw r0,16(r31)
0x10ef24 li r0,0x0 # 0
0x10ef28 stw r0,20(r31)
0x10ef2c li r0,0x0 # 0
0x10ef30 stw r0,8(r31)
0x10ef34 lwz r0,8(r31)
0x10ef38 lis r9,0x98 # 152
0x10ef3c ori r9,r9,0x967f
0x10ef40 cmpl crf0,0,r0,r9
0x10ef44 bc 0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48 b 0x10ef6c # 0x0010ef6c
0x10ef4c lfs fr0,12(r31)
0x10ef50 lfs fr13,16(r31)
0x10ef54 fmuls fr0,fr0,fr13
0x10ef58 stfs fr0,20(r31)
0x10ef5c lwz r9,8(r31)
0x10ef60 addi r0,r9,0x1 # 1
0x10ef64 stw r0,8(r31)
0x10ef68 b 0x10ef34 # 0x0010ef34
0x10ef6c lwz r11,0(r1)
0x10ef70 lwz r31,-4(r11)
0x10ef74 or r1,r11,r11
0x10ef78 blr
If I read this correctly, it's loading the values from memory during every iteration in both cases, but it seems to have generated a lot more instructions to do so in the non-volatile case.
Here's the source:
void floatTest()
{
unsigned long i;
volatile double d1 = 500.234, d2 = 4.000001, d3=0;
for(i=0; i<10000000; i++)
d3 = d1*d2;
}
Are you sure you didn't also change optimization settings?
The original looks un-optimized - here's the looping part:
0x10ef34 lwz r0,8(r31) //Put 'i' in r0.
0x10ef38 lis r9,0x98 # 152 //Put MSB of 10000000 in r9
0x10ef3c ori r9,r9,0x967f //Put LSB of 10000000 in r9
0x10ef40 cmpl crf0,0,r0,r9 //compare r0 to r9
0x10ef44 bc 0x4,1, 0x10ef4c //branch to loop if r0<r9
0x10ef48 b 0x10ef6c //else branch to end
0x10ef4c lfs fr0,12(r31) //load d1
0x10ef50 lfs fr13,16(r31) //load d2
0x10ef54 fmuls fr0,fr0,fr13 //multiply
0x10ef58 stfs fr0,20(r31) //save d3
0x10ef5c lwz r9,8(r31) //load i into r9
0x10ef60 addi r0,r9,0x1 //add 1
0x10ef64 stw r0,8(r31) //save i
0x10ef68 b 0x10ef34 //go back to top, must reload r9
The volatile version looks quite optimized - It rearanges instructions, and uses the special purpose counter register instead of storing i on the stack:
0x10ef00 lis r9,0x98 # 152 //MSB of 10M
//.. 4 initialization instructions here ..
0x10ef14 ori r9,r9,0x9680 //LSB of 10,000000
0x10ef18 mtspr CTR,r9 // store r9 in Special Purpose CTR register
0x10ef1c lfs fr0,8(r1) // load d1
0x10ef20 lfs fr13,12(r1) // load d2
0x10ef24 fmuls fr0,fr0,fr13 // multiply
0x10ef28 stfs fr0,16(r1) // store result
0x10ef2c bc 0x10,0, 0x10ef1c // decrement counter and branch if not 0.
The CTR optimization reduces the loop to 5 instructions, instead of the 14 in the original code. I don't see any reason 'volatile' by itself would enable that optimization.
Related
I'm trying to implement functions and recursion in an ASM-like simplified language that has no procedures. Only simple jumpz, jump, push, pop, add, mul type commands.
Here are the commands:
(all variables and literals are integers)
set (sets the value of an already existing variable or declares and initializes a new variable) e.g. (set x 3)
push (pushes a value onto the stack. can be a variable or an integer) e.g. (push 3) or (push x)
pop (pops the stack into a variable) e.g. (pop x)
add (adds the second argument to the first argument) e.g. (add x 1) or (add x y)
mul (same as add but for multiplication)
jump (jumps to a specific line of code) e.g. (jump 3) would jump to line 3 or (jump x) would jump to the line # equal to the value of x
jumpz (jumps to a line number if the second argument is equal to zero) e.g. (jumpz 3 x) or (jumpz z x)
The variable 'IP' is the program counter and is equal to the line number of the current line of code being executed.
In this language, functions are blocks of code at the bottom of the program that are terminated by popping a value off the stack and jumping to that value. (using the stack as a call stack) Then the functions can be called anywhere else in the program by simply pushing the instruction pointer onto the stack and then jumping to the start of the function.
This works fine for non-recursive functions.
How could this be modified to handle recursion?
I've read that implementing recursion with a stack is a matter of pushing parameters and local variables onto the stack (and in this lower level case, also the instruction pointer I think)
I wouldn't be able to do something like x = f(n). To do this I'd have some variable y (that is also used in the body of f), set y equal to n, call f which assigns its "return value" to y and then jumps control back to where f was called from, where we then set x equal to y.
(a function that squares a number whose definition starts at line 36)
1 - set y 3
2 - set returnLine IP
3 - add returnLine 2
4 - push returnLine
5 - jump 36
6 - set x y
...
36 - mul y 2
37 - pop returnLine
38 - jump returnLine
This doesn't seem to lend itself to recursion. Arguments and intermediate values would need to go on the stack and I think multiple instances on the stack of the same address would result from recursive calls which is fine.
Next code raises the number "base" to the power "exponent" recursively in "John Smith Assembly":
1 - set base 2 ;RAISE 2 TO ...
2 - set exponent 4 ;... EXPONENT 4 (2^4=16).
3 - set result 1 ;MUST BE 1 IN ORDER TO MULTIPLY.
4 - set returnLine IP ;IP = 4.
5 - add returnLine 4 ;RETURNLINE = 4+4.
6 - push returnLine ;PUSH 8.
7 - jump 36 ;CALL FUNCTION.
.
.
.
;POWER FUNCTION.
36 - jumpz 43 exponent ;FINISH IF EXPONENT IS ZERO.
37 - mul result base ;RESULT = ( RESULT * BASE ).
38 - add exponent -1 ;RECURSIVE CONTROL VARIABLE.
39 - set returnLine IP ;IP = 39.
40 - add returnLine 4 ;RETURN LINE = 39+4.
41 - push returnLine ;PUSH 43.
42 - jump 36 ;RECURSIVE CALL.
43 - pop returnLine
44 - jump returnLine
;POWER END.
In order to test it, let's run it manually :
LINE | BASE EXPONENT RESULT RETURNLINE STACK
------|---------------------------------------
1 | 2
2 | 4
3 | 1
4 | 4
5 | 8
6 | 8
7 |
36 |
37 | 2
38 | 3
39 | 39
40 | 43
41 | 43(1)
42 |
36 |
37 | 4
38 | 2
39 | 39
40 | 43
41 | 43(2)
42 |
36 |
37 | 8
38 | 1
39 | 39
40 | 43
41 | 43(3)
42 |
36 |
37 | 16
38 | 0
39 | 39
40 | 43
41 | 43(4)
42 |
36 |
43 | 43(4)
44 |
43 | 43(3)
44 |
43 | 43(2)
44 |
43 | 43(1)
44 |
43 | 8
44 |
8 |
Edit : parameter for function now on stack (didn't run it manually) :
1 - set base 2 ;RAISE 2 TO ...
2 - set exponent 4 ;... EXPONENT 4 (2^4=16).
3 - set result 1 ;MUST BE 1 IN ORDER TO MULTIPLY.
4 - set returnLine IP ;IP = 4.
5 - add returnLine 7 ;RETURNLINE = 4+7.
6 - push returnLine ;PUSH 11.
7 - push base ;FIRST PARAMETER.
8 - push result ;SECOND PARAMETER.
9 - push exponent ;THIRD PARAMETER.
10 - jump 36 ;FUNCTION CALL.
...
;POWER FUNCTION.
36 - pop exponent ;THIRD PARAMETER.
37 - pop result ;SECOND PARAMETER.
38 - pop base ;FIRST PARAMETER.
39 - jumpz 49 exponent ;FINISH IF EXPONENT IS ZERO.
40 - mul result base ;RESULT = ( RESULT * BASE ).
41 - add exponent -1 ;RECURSIVE CONTROL VARIABLE.
42 - set returnLine IP ;IP = 42.
43 - add returnLine 7 ;RETURN LINE = 42+7.
44 - push returnLine ;PUSH 49.
45 - push base
46 - push result
47 - push exponent
48 - jump 36 ;RECURSIVE CALL.
49 - pop returnLine
50 - jump returnLine
;POWER END.
Your asm does provide enough facilities to implement the usual procedure call / return sequence. You can push a return address and jump as a call, and pop a return address (into a scratch location) and do an indirect jump to it as a ret. We can just make call and ret macros. (Except that generating the correct return address is tricky in a macro; we might need a label (push ret_addr), or something like set tmp, IP / add tmp, 4 / push tmp / jump target_function). In short, it's possible and we should wrap it up in some syntactic sugar so we don't get bogged down with that while looking at recursion.
With the right syntactic sugar, you can implement Fibonacci(n) in assembly that will actually assemble for both x86 and your toy machine.
You're thinking in terms of functions that modify static (global) variables. Recursion requires local variables so each nested call to the function has its own copy of local variables. Instead of having registers, your machine has (apparently unlimited) named static variables (like x and y). If you want to program it like MIPS or x86, and copy an existing calling convention, just use some named variables like eax, ebx, ..., or r0 .. r31 the way a register architecture uses registers.
Then you implement recursion the same way you do in a normal calling convention, where either the caller or callee use push / pop to save/restore a register on the stack so it can be reused. Function return values go in a register. Function args should go in registers. An ugly alternative would be to push them after the return address (creating a caller-cleans-the-args-from-the-stack calling convention), because you don't have a stack-relative addressing mode to access them the way x86 does (above the return address on the stack). Or you could pass return addresses in a link register, like most RISC call instructions (usually called bl or similar, for branch-and-link), instead of pushing it like x86's call. (So non-leaf callees have to push the incoming lr onto the stack themselves before making another call)
A (silly and slow) naively-implemented recursive Fibonacci might do something like:
int Fib(int n) {
if(n<=1) return n; // Fib(0) = 0; Fib(1) = 1
return Fib(n-1) + Fib(n-2);
}
## valid implementation in your toy language *and* x86 (AMD64 System V calling convention)
### Convenience macros for the toy asm implementation
# pretend that the call implementation has some way to make each return_address label unique so you can use it multiple times.
# i.e. just pretend that pushing a return address and jumping is a solved problem, however you want to solve it.
%define call(target) push return_address; jump target; return_address:
%define ret pop rettmp; jump rettmp # dedicate a whole variable just for ret, because we can
# As the first thing in your program, set eax, 0 / set ebx, 0 / ...
global Fib
Fib:
# input: n in edi.
# output: return value in eax
# if (n<=1) return n; // the asm implementation of this part isn't interesting or relevant. We know it's possible with some adds and jumps, so just pseudocode / handwave it:
... set eax, edi and ret if edi <= 1 ... # (not shown because not interesting)
add edi, -1
push edi # save n-1 for use after the recursive call
call Fib # eax = Fib(n-1)
pop edi # restore edi to *our* n-1
push eax # save the Fib(n-1) result across the call
add edi, -1
call Fib # eax = Fib(n-2)
pop edi # use edi as a scratch register to hold Fib(n-1) that we saved earlier
add eax, edi # eax = return value = Fib(n-1) + Fib(n-2)
ret
During a recursive call to Fib(n-1) (with n-1 in edi as the first argument), the n-1 arg is also saved on the stack, to be restored later. So each function's stack frame contains the state that needs to survive the recursive call, and a return address. This is exactly what recursion is all about on a machine with a stack.
Jose's example doesn't demonstrate this as well, IMO, because no state needs to survive the call for pow. So it just ends up pushing a return address and args, then popping the args, building up just some return addresses. Then at the end, follows the chain of return addresses. It could be extended to save local state across each nested call, doesn't actually illustrate it.
My implementation is a bit different from how gcc compiles the same C function for x86-64 (using the same calling convention of first arg in edi, ret value in eax). gcc6.1 with -O1 keeps it simple and actually does two recursive calls, as you can see on the Godbolt compiler explorer. (-O2 and especially -O3 do some aggressive transformations). gcc saves/restores rbx across the whole function, and keeps n in ebx so it's available after the Fib(n-1) call. (and keeps Fib(n-1) in ebx to survive the second call). The System V calling convention specifies rbx as a call-preserved register, but rbi as call-clobbered (and used for arg-passing).
Obviously you can implement Fib(n) much faster non-recursively, with O(n) time complexity and O(1) space complexity, instead of O(Fib(n)) time and space (stack usage) complexity. It makes a terrible example, but it is trivial.
Margaret's pastebin modified slightly to run in my interpreter for this language: (infinite loop problem, probably due to a transcription error on my part)
set n 3
push n
set initialCallAddress IP
add initialCallAddress 4
push initialCallAddress
jump fact
set finalValue 0
pop finalValue
print finalValue
jump 100
:fact
set rip 0
pop rip
pop n
push rip
set temp n
add n -1
jumpz end n
push n
set link IP
add link 4
push link
jump fact
pop n
mul temp n
:end
pop rip
push temp
jump rip
Successful transcription of Peter's Fibonacci calculator:
String[] x = new String[] {
//n is our input, which term of the sequence we want to calculate
"set n 5",
//temp variable for use throughout the program
"set temp 0",
//call fib
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//program is finished, prints return value and jumps to end
"print returnValue",
"jump end",
//the fib function, which gets called recursively
":fib",
//if this is the base case, then we assert that f(0) = f(1) = 1 and return from the call
"jumple base n 1",
"jump notBase",
":base",
"set returnValue n",
"pop temp",
"jump temp",
":notBase",
//we want to calculate f(n-1) and f(n-2)
//this is where we calculate f(n-1)
"add n -1",
"push n",
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//return from the call that calculated f(n-1)
"pop n",
"push returnValue",
//now we calculate f(n-2)
"add n -1",
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//return from call that calculated f(n-2)
"pop n",
"add returnValue n",
//this is where the fib function ultimately ends and returns to caller
"pop temp",
"jump temp",
//end label
":end"
};
I have a program for avr where I would like to use a pointer to a method. But why is using function pointer over normal call almost 4 times slower?? And how do I speed it up?
I have:
void simple_call(){ PORTB |= _BV(1); }
void (*simple)() = &simple_call;
Then if I compile with -O3 and call:
simple_call()
it takes 250ns to complete. If I instead call:
simple()
it takes 960ns to complete!!
How can I make it faster?
why is it slower??
You see a 710 ns increase in time. For a 16 MHz clock, that time is 11 ticks.
It is not really fair to say 4X because the time increase is a constant overhead for the function pointer. In your case, the function body is tiny, so the overhead is relatively large. But if you had a case where the function was large and took 1 ms to execute, the time increase would still be 710 ns and you would be asking why does the function pointer take 0.07% longer?
To see why one approach is faster than another, you need to get at the assembler code. Using build tools such as Eclipse allows you the get an assembler listing from the GCC compiler by adding command line options not available with the Arduino IDE. This is invaluable to figure out what is going on.
Here is a section of the assembler listing showing what you think is going on:
simple_call();
308: 0e 94 32 01 call 0x264 ; 0x264 <_Z11simple_callv>
simple();
30c: e0 91 0a 02 lds r30, 0x020A
310: f0 91 0b 02 lds r31, 0x020B
314: 19 95 eicall
These listings show the source code and assembler produced by the compiler. To make sense of that and figure out timing, you need the Atmel AVR instruction reference which contains descriptions of every instruction and the number of clock ticks they take. The simple_call() is maybe what you expect and takes 4 ticks. The simple() says:
LDS = load address byte - 2 ticks
LDS = load address byte - 2 ticks
EICALL = indirect call to address loaded - 4 ticks
Those both call the function simple_call():
void simple_call(){ PORTB |= _BV(1); }
264: df 93 push r29
266: cf 93 push r28
268: cd b7 in r28, 0x3d ; 61
26a: de b7 in r29, 0x3e ; 62
26c: a5 e2 ldi r26, 0x25 ; 37
26e: b0 e0 ldi r27, 0x00 ; 0
270: e5 e2 ldi r30, 0x25 ; 37
272: f0 e0 ldi r31, 0x00 ; 0
274: 80 81 ld r24, Z
276: 82 60 ori r24, 0x02 ; 2
278: 8c 93 st X, r24
27a: cf 91 pop r28
27c: df 91 pop r29
27e: 08 95 ret
So the function pointer should take just 4 more ticks and be small compared to all the instructions in the function method.
Above, I said should and what you think is going on. I lied a bit: the assembler above is for no optimizations.
You used the optimization -O3 which changes everything.
With the optimizations, the function body gets squashed to almost nothing:
void simple_call(){ PORTB |= _BV(1); }
264: 29 9a sbi 0x05, 1 ; 5
266: 08 95 ret
That is 2 + 4 ticks. The compiler gurus have coded the compiler to figure out a much better way to execute the one line of C++. But wait there is more. When you "call" your function the compiler says "why do that? it is just one assembler instruction". The compiler decides your call is pointless and puts the instructions inline:
void simple_call(){ PORTB |= _BV(1); }
2d6: 29 9a sbi 0x05, 1 ; 5
But with the optimizations, the function pointer call remains a call:
simple();
2d8: e0 91 0a 02 lds r30, 0x020A
2dc: f0 91 0b 02 lds r31, 0x020B
2e0: 19 95 eicall
So lets see if the math adds up. With the inline, the "call" is 3 ticks. The indirect call is 8 + 6 = 14. The difference is 11 ticks! (I can add!)
So that is **why*.
how do I speed it up?
You don't need to: It is only 4 clock ticks more to make a function pointer call. Except for the most trivial functions, it does not matter.
You can't: Even if you try to inline the functions, you still need a conditional branch. A bunch of load, compare, and conditional jumps will take more than the indirect call. In other words, the function pointer is a better method of branching than any conditional.
I'm reading source code in package syscall now, and met some problems:
Since I'm totally a noob of syscall and assembly, so don't hesitate to share anything you know about it :)
First about func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno) : what does its parameter trap, a1, a2, a3 & return value r1 r2 means? I've searched documents and site but seems lack of description about this.
Second, since I'm using darwin/amd64 I searched source code and find it here:
http://golang.org/src/pkg/syscall/asm_darwin_amd64.s?h=RawSyscall
Seems it's written by assemble(which I can't understand), can you explain what happened in line 61-80, and what's the meaning of ok1: part under line 76?
I also found some code in http://golang.org/src/pkg/syscall/zsyscall_darwin_amd64.go , what does zsyscall mean in its filename?
What's the difference between syscall & rawsyscall?
How and when to use them if I want to write my own syscall function(Yes, os package gave many choices but there are still some situation it doesn't cover)?
So many noob questions, thanks for your patience to read and answer :)
I'll share my reduced assembly knowledge with you:
61 TEXT ·RawSyscall(SB),7,$0
62 MOVQ 16(SP), DI
63 MOVQ 24(SP), SI
64 MOVQ 32(SP), DX
65 MOVQ $0, R10
66 MOVQ $0, R8
67 MOVQ $0, R9
68 MOVQ 8(SP), AX // syscall entry
69 ADDQ $0x2000000, AX
70 SYSCALL
71 JCC ok1
72 MOVQ $-1, 40(SP) // r1
73 MOVQ $0, 48(SP) // r2
74 MOVQ AX, 56(SP) // errno
75 RET
76 ok1:
77 MOVQ AX, 40(SP) // r1
78 MOVQ DX, 48(SP) // r2
79 MOVQ $0, 56(SP) // errno
80 RET
81
Line 61 is the routine entry point
Line 76 is a label called ok1
Line 71 is a conditional jump to label ok1.
The short names you see on every line on the left side are called mnemonics and stand for assembly instructions:
MOVQ means Move Quadword (64 bits of data).
ADDQ is Add Quadword.
SYSCALL is kinda obvious
JCC is Jump if Condition (condition flag set by previous instruction)
RET is return
On the right side of the mnemonics you'll find each instruction's arguments which are basically constants and registers.
SP is the Stack Pointer
AX is the Accumulator
BX is the Base register
each register can hold a certain amount of data. On 64 bit CPU architectures I believe it's in fact 64 bits per register.
The only difference between Syscall and RawSyscall is on line 14, 28 and 34 where Syscall will call runtime·entersyscall(SB) and runtime·exitsyscall(SB) whereas RawSyscall will not. I assume this means that Syscall notifies the runtime that it's switched to a blocking syscall operations and can yield CPU-time to another goroutine/thread whereas RawSyscall will just block.
I am trying to write a program to calculate the exponential of a number using ARM-C inter-working. I am using LPC1769(cortex m3) for debuuging. The following is the code:
/*here is the main.c file*/
#include<stdio.h>
#include<stdlib.h>
extern int Start (void);
extern int Exponentiatecore(int *m,int *n);
void print(int i);
int Exponentiate(int *m,int *n);
int main()
{
Start();
return 0;
}
int Exponentiate(int *m,int *n)
{
if (*n==0)
return 1;
else
{
int result;
result=Exponentiatecore(m,n);
return (result);
}
}
void print(int i)
{
printf("value=%d\n",i);
}
this is the assembly code which complements the above C code
.syntax unified
.cpu cortex-m3
.thumb
.align
.global Start
.global Exponentiatecore
.thumb
.thumb_func
Start:
mov r10,lr
ldr r0,=label1
ldr r1,=label2
bl Exponentiate
bl print
mov lr,r10
mov pc,lr
Exponentiatecore: // r0-&m, r1-&n
mov r9,lr
ldr r4,[r0]
ldr r2,[r1]
loop:
mul r4,r4
sub r2,#1
bne loop
mov r0,r4
mov lr,r9
mov pc,lr
label1:
.word 0x02
label2:
.word 0x03
however during the debug session, I encounter a Hardfault error for the execution of "Exponentiatecore(m,n)".
as seen in debug window.
Name : HardFault_Handler
Details:{void (void)} 0x21c <HardFault_Handler>
Default:{void (void)} 0x21c <HardFault_Handler>
Decimal:<error reading variable>
Hex:<error reading variable>
Binary:<error reading variable>
Octal:<error reading variable>
Am I making some stack corruption during alignment or is there a mistake in my interpretation?
please kindly help.
thankyou in advance
There are several problems with your code. The first is that you have an infinite loop because your SUB instruction is not setting the flags. Change it to SUBS. The next problem is that you're manipulating the LR register unnecessarily. You don't call other functions from Exponentiatecore, so don't touch LR. The last instruction of the function should be "BX LR" to return to the caller. Problem #3 is that your multiply instruction is wrong. Besides taking 3 parameters, if you multiplied the number by itself, it would grow too quickly. For example:
ExponentiateCore(10, 4);
Values through each loop:
R4 = 10, n = 4
R4 = 100, n = 3
R4 = 10000, n = 2
R4 = 100,000,000 n = 1
Problem #4 is that you're changing a non-volatile register (R4). Unless you save/restore them, you're only allowed to trash R0-R3. Try this instead:
Start:
stmfd sp!,{lr}
ldr r0,=label1
ldr r1,=label2
bl Exponentiatecore // no need to call C again
bl print
ldmfd sp!,{pc}
Exponentiatecore: // r0-&m, r1-&n
ldr r0,[r0]
mov r2,r0
ldr r1,[r1]
cmp r1,#0 // special case for exponent value of 0
moveq r0,#1
moveq pc,lr // early exit
loop:
mul r0,r0,r2 // multiply the original value by itself n times
subs r1,r1,#1
bne loop
bx lr
I just add
Start:
push {r4-r11,lr}
...
pop {r4-r11,pc}
Exponentiatecore: # r0-&m, r1-&n
push {r4-r11,lr}
...
pop {r4-r11,pc}
and clean bl print in Start and all work fine
I am new to programming in asm. I am trying to create a floppy that prints a message on the boot sector, jumps to sector 35 and prints the date, then jumps back to the boot sector and prints a prompt. I am having trouble (I think) jumping between the sectors... I had everything printing fine when it was all on the boot sector and I haven't changed the actual printing code. What I am getting currently is the first line of the message and then the date and prompt never print. The code is below; I am using NASM:
For the boot sector:
org 0x7c00 ;load to appropariate MBR location
start:
call cls ;call routine to clear screen
call dspmsg ;call routine to display message
mov ah,02h ;read disk sectors into memory
mov al,1 ;number of sectors to read/write (must be nonzero)
mov ch,1 ;cylinder number (0...79)
mov cl,18 ;sector number (1...18)
mov dh,0 ;head number (0...1)
mov dl,0 ;drive number (0...3, 0 for floppy)
mov bx, 0x1000
mov es,bx
mov bx,0x0000
int 13h
call word 0x1000:0x0000
push cs
pop ds
call dspmsg2
jmp $
%macro dsp 3
mov ah,13h ;function 13h (Display String)
mov al,1 ;Write mode is one
mov bh,0 ;Use video page of zero
mov bl,0AH ;Attribute
mov cx,%1 ;Character string length
mov dh,%2 ;position on row
mov dl,0 ;and column 28
push ds ;put ds register on stack
pop es ;pop it into es register
lea bp,%3 ;load the offset address of string into BP
int 10H
%endmacro
cls:
mov ah,06h ;function 06h (Scroll Screen)
mov al,0 ;scroll all lines
mov bh,0AH ;Attribute (light green on black)
mov ch,0 ;Upper left row is zero
mov cl,0 ;Upper left column is zero
mov dh,24 ;Lower left row is 24
mov dl,79 ;Lower left column is 79
int 10H ;BIOS Interrupt 10h (video services)
ret
msg: db 'OS321, made by CHRISTINE MCGINN (c) 2011'
dspmsg:
dsp 40,0,[msg]
ret
msg2: db '$'
dspmsg2:
;Display a message
dsp 1,2,[msg2]
ret
times 510-($-$$) db 0 ;Pad remainder of boot sector with 0s
dw 0xAA55 ;done setting the MBR
Then on sector 35:
org 0x0000
push cs
pop ds
call date
call cvtmo
call cvtday
call cvtcent
call cvtyear
call time
call cvthrs
call cvtmin
call cvtsec
call dsptimedate
retf
%macro dsp 3
mov ah,13h ;function 13h (Display String)
mov al,1 ;Write mode is one
mov bh,0 ;Use video page of zero
mov bl,0AH ;Attribute
mov cx,%1 ;Character string length
mov dh,%2 ;position on row
mov dl,0 ;and column 28
push ds ;put ds register on stack
pop es ;pop it into es register
lea bp,%3 ;load the offset address of string into BP
int 10H
%endmacro
%macro cvt 3
mov bh,%1 ;copy contents of %1 to bh
shr bh,1
shr bh,1
shr bh,1
shr bh,1
add bh,30h ;add 30h to convert to ascii
mov [tmdtfld + %2],bh
mov bh,%1
and bh,0fh
add bh,30h
mov [tmdtfld + %3],bh
%endmacro
date:
;Get date from the system
mov ah,04h ;function 04h (get RTC date)
int 1Ah ;BIOS Interrupt 1Ah (Read Real Time Clock)
ret
;CH - Century
;CL - Year
;DH - Month
;DL - Day
cvtmo:
;Converts the system date from BCD to ASCII
cvt dh,9,10
ret
cvtday:
cvt dl,12,13
ret
cvtcent:
cvt ch,15,16
ret
cvtyear:
cvt cl,17,18
ret
time:
;Get time from the system
mov ah,02h
int 1Ah
ret
;CH - Hours
;CL - Minutes
;DH - Seconds
cvthrs:
;Converts the system time from BCD to ASCII
cvt ch,0,1
ret
cvtmin:
cvt cl,3,4
ret
cvtsec:
cvt dh,6,7
ret
tmdtfld: db '00:00:00 00/00/0000'
dsptimedate:
;Display the system time
dsp 19,1,[tmdtfld]
ret
times 512-($-$$) db 0 ;Pad remainder of sector with 0s
Thank you for any help you can offer!
You've pose the question in a confusing way. You seem to have two problems:
1) Reading arbitrary sectors from the floppy
2) Having programs in each sector that do something (e.g, print a string)
I'd organize my program as a floppy driver ("call floppy_read_sector(x)") [which might
use the bios to do most of the dirty work, but that's an implementation detail),
and as as set of seperate position-independent code blocks that did the various tasks
as subroutines.
Your boot sector code should contain the floppy driver and the high level logic
to read sector(n), call the subroutine in the buffer you read the sector into,
and then do the next sector. (You don't have a lot of room so I don't know
if you can squeeze all this into the boot sector. Welcome to assembly language
programming where counting bytes is important).
Then you have to organize the construction of the floppy disk somehow.
Usually in a world in which one creates bootable floppies, you are allowed to build
much more complicated programs to fill them up. Exercise left to the reader.