Compiling an Assembly Program using avr - pointers

why do we need to initialize stack pointer in the begnning of the program of AVR assembly programming

Your assembly program is calling a subroutine. When you do that, the return address is stored on the stack using the stack pointer, so it's important to initialize it to point to an appropriate place in RAM. The ATmega328P datasheet says:
During interrupts and subroutine calls, the return address Program Counter (PC) is stored on the Stack. The
Stack is effectively allocated in the general data SRAM, and consequently the Stack size is only limited by the
total SRAM size and the usage of the SRAM. All user programs must initialize the SP in the Reset routine
(before subroutines or interrupts are executed). The Stack Pointer (SP) is read/write accessible in the I/O space.
The data SRAM can easily be accessed through the five different addressing modes supported in the AVR
architecture.

Very simple, the answer comes straight frm the datasheet - look for Stack Pointer. Stack pointer initial value is 0x0000, meaning it would point to register R0 (which adress is 0x0000) if not initialized. You would not want that, as you use R0 and other register to perform operations. That is why you want to set the stack to some other memory area, specifically to the Internal SRAM (a general purpose RAM area).

It depends on the microcontroller you are using. Older AVRs had the stack pointer initialized by hardware to 0x0000. You had to change that to something sensible (most often RAMEND) before using subroutines or interrupts. Newer AVRs have the stack pointer initialized by hardware to RAMEND, so you do not need software initialization.
You will have to check the datasheet to see whether your particular MCU needs that software initialization or not. Where in doubt, do it anyway: it doesn't hurt (it takes only 4 CPU cycles) and it can make your code more portable. Also, a bootloader may have altered the stack pointer.

Related

int32_t for stack in 32 microcontroller

In bellow example why we should use int32_t instead of uint32_t ? (platform is ARM 32 bit microcontroller)
struct tcb{
int32_t *stackPt;
struct tcb *nextPt;
};
It's a part of RTOS tutorial. and tcb is for thread control block .
why we should use int32_t* for stack ?
There is no particular reason that you should use pointer to signed rather than unsigned.
You are probably never going to dereference this pointer directly to access the words on the stack. If you do want to access the data on the stack, some of it will be signed, some unsigned, and some neither (strings etc) so the pointer type will not help you with that.
When you want to pass around a pointer but never dereference it then one convention is to use a pointer to void, but that convention isn't so popular in embedded system code.
One reason to use a pointer to a 32-bit integer is to suggest that the pointer is at least word-aligned. If you intend on complying with the ARM EABI (which you should) then the stack should be doubleword (64-bit) aligned at the entry to every EABI compliant function. To hint that that is the case you might want to even use a (u)int64_t pointer. This might be misleading though because not everything on the stack is 64-bit or 32-bit aligned, just the whole frames.

What is a compiled stack?

I came across said terminology and already picked up that it's somehow used on microcontrollers, but have not found any explanation. What is a compiled stack, what is it used for and why?
Compiled stack is a technology used in the PIC range of Microcontrollers.
From MPLAB XC8 C Compiler User's Guide:
A compiled stack is a static allocation of memory for stack-based objects that can be built up in multiple data banks. See Section 5.5.2.2.1 “Compiled Stack Operation” for information about how objects are allocated to this stack. Objects in the stack are in fixed locations and can be accessed using an identifier (hence it is a static allocation). Thus, there is no stack pointer. The size of the compiled stack is known at compile time, and so available space can be confirmed by the compiler. The compiled stack is allo-cated to psects that use the basename cstack; for example, cstackCOMMON, cstackBANK0. See Section 5.15.2 “Compiler-Generated Psects” for more information on the naming convention for compiler-generated psects.
By contrast, the software stack has a size that is dynamic and varies as the program is executed. The maximum size of the stack is not exactly known at compile time and the compiler typically reserves as much space as possible for the stack to grow during pro-gram execution. The stack is always allocated a single memory range, which may cross bank boundaries, but within this range it may be segregated into one area for main-line code and an area for each interrupt routine, if required. A stack pointer is used to indicate the current position in the stack. This pointer is permanently allocated to FSR1.

Does the processor use more than one stack to separate the call stack from the expression/register stack?

I was reading some basic articles about memory manipulation by the processor, and I was confused as to how the processor handles what comes next.
The concept of the call stack is clear, but I was wondering if the expression stack/register stack (used to make the calculations) is the same stack, or even if the stack for the local variables of a subroutine (a function) in a program is the same call stack.
If anyone could explain to me how the processor operates regarding its stack(s), that'd help me a lot.
All the processors I've worked on have just used a single stack for these.
If you think about what the processor is doing, you only need a single stack. During calculations you can use the same stack as the calling stack, as when the calculation is complete the stack will be 'clean' again. Same for local variables, just before you go out of the scope of the local variables your stack will be clean allowing the call to return correctly.
You can change the stack just set the SS:SP segment and pointer registers (just save the current values)
The procedure call parameters and local variables takes place in the stack. And the dynamically created objects take place in the heap (DS:DI). The SS:SP register pair shifted by the right amount of bytes to reserve the needed memory on the procedure call. And on the return the SS:SP sets back to the pre call state.

x86 Instruction Encoding C++ Pointer as Address

I'm trying to encode x86 instructions to take an address specified by a C++ pointer. For example, I have a 32 bit number that I would like to move to the eax register. In assembly, this is easy.
mov eax, number
I would like to encode in binary the x86 instruction to do this using a C++ pointer to the number. The opcode for mov (memory to register 32 bit) is 0x8B. I am not sure what to use for the Mod-Reg-R/M byte and the displacement. Is there a way to encode the address from the pointer directly? Or do I have to do some displacement math? Also, I don't really understand how displacement works.
The instructions will be stored and called in dynamically allocated memory. This is for a dynamic recompiler for an emulator. I have tried all sorts of combinations with Mod-Reg-R/M and the pointer, with no luck. I also haven't been able to find anything online explaining how this works.

Multitasking using setjmp, longjmp

is there a way to implement multitasking using setjmp and longjmp functions
You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.
It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...
This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.
I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.
As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Resources