I came across said terminology and already picked up that it's somehow used on microcontrollers, but have not found any explanation. What is a compiled stack, what is it used for and why?
Compiled stack is a technology used in the PIC range of Microcontrollers.
From MPLAB XC8 C Compiler User's Guide:
A compiled stack is a static allocation of memory for stack-based objects that can be built up in multiple data banks. See Section 5.5.2.2.1 “Compiled Stack Operation” for information about how objects are allocated to this stack. Objects in the stack are in fixed locations and can be accessed using an identifier (hence it is a static allocation). Thus, there is no stack pointer. The size of the compiled stack is known at compile time, and so available space can be confirmed by the compiler. The compiled stack is allo-cated to psects that use the basename cstack; for example, cstackCOMMON, cstackBANK0. See Section 5.15.2 “Compiler-Generated Psects” for more information on the naming convention for compiler-generated psects.
By contrast, the software stack has a size that is dynamic and varies as the program is executed. The maximum size of the stack is not exactly known at compile time and the compiler typically reserves as much space as possible for the stack to grow during pro-gram execution. The stack is always allocated a single memory range, which may cross bank boundaries, but within this range it may be segregated into one area for main-line code and an area for each interrupt routine, if required. A stack pointer is used to indicate the current position in the stack. This pointer is permanently allocated to FSR1.
Related
Does Frama-C provide any tools for proving the run-time characteristics of a function such as execution time (possibly as instruction count) and heap memory space (counted as bytes allocated)?
Concerning execution time estimation
Frama-C works at the C level. The Metrics plug-in can provide a few metrics (such as statement count) on a version of the source very close to the original one (-metrics -metrics-ast cabs), or on the normalized source (often referred to as Cil code) that it uses. However, it does not have any knowledge of assembly code, therefore it cannot provide precise information about execution time at this level.
Since compiler optimizations impact code generation, the numbers given by Frama-C may or may not be close to what will be produced by a compiler, depending on which optimizations are enabled, what is known about the compiler and the target architecture, etc. In the general case, Frama-C cannot give any guarantees; in specific situations, it is possible to develop plug-ins to provide some of this information (e.g. the Cost plug-in, mentioned here uses annotations to try and maintain some correspondence between source and compiled code, and then uses them to provide some execution time information).
Concerning memory size estimation
There is an option, -metrics-locals-size, which does a rough estimation of the stack memory usage by a function. As in the previous case, this is only an estimation based on the source code. Compilers are likely to stack-allocate temporary variables for computing temporary subexpressions, or for register spilling, so the numbers given by Frama-C cannot be used in a worst-case stack estimation.
Dynamic memory allocation is supported in ACSL, so in theory it is possible to write annotations concerning it. However, current plug-ins do not provide a direct way to handle this precisely; it might require writing a new plug-in or, at least, an abstract domain for Eva.
Eva currently handles dynamic allocation, but probably not precisely enough for estimating heap size in an interesting way. It is possible to develop an abstract domain for Eva that would keep track of this information (adding mallocs and subtracting frees) and compute an overapproximation of the heap memory space, but this would require being able to bound the number of iterations of loops containing allocations (otherwise the upper bound would be infinite). Precision would depend on the complexity of the program.
For runtime verification, the E-ACSL plug-in already tracks some stack/heap information usage (even though it is not currently exported to the user), so in theory one could write an assertion similar to //# assert heap_size <= \old(heap_size) + 42;, and have it checked at runtime, when running the instrumented program.
To complement anol's answer, the PathCrawler plug-in (online version can be used freely, but the plugin itself is proprietary) has been used to generate sets of test cases covering all paths of C functions. This article explains under which assumptions this can be used as the basis for WCET measurement, but basically the issues are the one already mentioned by anol: without a precise knowledge of the work done by the compiler and of the underlying hardware, which is not something Frama-C provides natively, things are going to be quite rough.
There has been apparently some recent work taking the same route of using PathCrawler for generating execution traces covering a sufficiently large proportion of the search space as a bachelor project in Amsterdam.
why do we need to initialize stack pointer in the begnning of the program of AVR assembly programming
Your assembly program is calling a subroutine. When you do that, the return address is stored on the stack using the stack pointer, so it's important to initialize it to point to an appropriate place in RAM. The ATmega328P datasheet says:
During interrupts and subroutine calls, the return address Program Counter (PC) is stored on the Stack. The
Stack is effectively allocated in the general data SRAM, and consequently the Stack size is only limited by the
total SRAM size and the usage of the SRAM. All user programs must initialize the SP in the Reset routine
(before subroutines or interrupts are executed). The Stack Pointer (SP) is read/write accessible in the I/O space.
The data SRAM can easily be accessed through the five different addressing modes supported in the AVR
architecture.
Very simple, the answer comes straight frm the datasheet - look for Stack Pointer. Stack pointer initial value is 0x0000, meaning it would point to register R0 (which adress is 0x0000) if not initialized. You would not want that, as you use R0 and other register to perform operations. That is why you want to set the stack to some other memory area, specifically to the Internal SRAM (a general purpose RAM area).
It depends on the microcontroller you are using. Older AVRs had the stack pointer initialized by hardware to 0x0000. You had to change that to something sensible (most often RAMEND) before using subroutines or interrupts. Newer AVRs have the stack pointer initialized by hardware to RAMEND, so you do not need software initialization.
You will have to check the datasheet to see whether your particular MCU needs that software initialization or not. Where in doubt, do it anyway: it doesn't hurt (it takes only 4 CPU cycles) and it can make your code more portable. Also, a bootloader may have altered the stack pointer.
I was reading some basic articles about memory manipulation by the processor, and I was confused as to how the processor handles what comes next.
The concept of the call stack is clear, but I was wondering if the expression stack/register stack (used to make the calculations) is the same stack, or even if the stack for the local variables of a subroutine (a function) in a program is the same call stack.
If anyone could explain to me how the processor operates regarding its stack(s), that'd help me a lot.
All the processors I've worked on have just used a single stack for these.
If you think about what the processor is doing, you only need a single stack. During calculations you can use the same stack as the calling stack, as when the calculation is complete the stack will be 'clean' again. Same for local variables, just before you go out of the scope of the local variables your stack will be clean allowing the call to return correctly.
You can change the stack just set the SS:SP segment and pointer registers (just save the current values)
The procedure call parameters and local variables takes place in the stack. And the dynamically created objects take place in the heap (DS:DI). The SS:SP register pair shifted by the right amount of bytes to reserve the needed memory on the procedure call. And on the return the SS:SP sets back to the pre call state.
Exactly what parts of a recursive method call contributes to the stack--say, the returned object, arguments, local variables, etc.?
I'm trying to optimize the levels of recursion that an Android application can do on limited memory before running into a StackOverflowException.
Thanks in advance.
If you run out of stack space, don't optimize your stack usage. Doing that just means the same problem will come back later, with a slightly larger input set or called from somewhere else. And at some point you have reached the theoretical or practical minimum of space you can consume for the problem you're solving. Instead, convert the offending code to use a collection other than the machine stack (e.g. a heap-allocated stack or queue). Doing so sometimes results in very ugly code, but at least it won't crash.
But to answer the question: Generally all the things you name can take stack space, and temporary values take space too (so nesting expressions like crazy just to save local variables won't help). Some of these will be stored in registers, depending on the calling convention, but may have to be spilled(*) anyway. But regardless of the calling convention, this only saves you a few bytes, and everything will have to be spilled for calls as the callee is given usually free reign over registers during the call. So at the time your stack overflows, the stack is indeed crowded with parameters, local variables, and temporaries of earlier calls. Some may be optimized away altogether or share a stack slot if they aren't needed at the same time. Ultimately this is up to the JIT compiler.
(*) Spilling: Moving a value from a register to memory (i.e., the stack) because the register is needed for something else.
Each method has two stack frame sizes associated with it, the stack required for arguments and local variables, and the stack required for expression evaluation. The return value only counts as part of the stack required for expression evaluation. The JVM is able to verify that the method does not exceed these sizes as it executes.
Exactly how much stack is required for variables and expression evaluation is down to the bytecode compiler. For instance it is often able to share local variable slots among variables with non-overlapping lifetimes.
is there a way to implement multitasking using setjmp and longjmp functions
You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.
It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...
This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.
I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.
As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.