Understanding STM8 pipelining - opcode

I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.
I have this example, where I toggle a GPIO pin for 4 cycles each.
Iff loop is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?
// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
__asm
bset 0x5011, #2 ; output mode
bset 0x5012, #2 ; push-pull
bset 0x5013, #2 ; fast switching
jra _loop
.bndry 4
nop
nop
nop
_loop:
nop
bset 0x500f, #2
nop
nop
nop
bres 0x500f, #2
jra _loop
__endasm;
}
A bit more context:
bset/bres are 4 byte instructions, nop 1 byte.
The nop/bset/bres instructions take 1 cycle each.
The jra instruction takes two cycles. I think in the first cycle, the instruction cache is filled with the next 32bit value, i.e. in this case the nop instruction only. And the 2nd cycle is actually just the CPU being stalled while decoding the next instruction.
So in cycles:
bres clears the pin
jra, pipeline flush, nop fetch
nop decode, bset fetch
nop execute, bset decode, next nop fetch
bset execute sets the pin
nop, bres fetch
nop
nop, bres decode
bres execute clears the pin
According to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.
In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.
I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset instruction (the nops thereafter provide enough time to make sure that bres later is ready to execute immediately). But according to my understanding nop (for 6.) would already be fetched in 4.
Any idea how this behavior can be explained? I couldn’t find any hints in the manual.

It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.
Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).
Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.
Any additional insight appreciated!

Related

Why is this jump instruction so expensive when performing pointer chasing?

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.
Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

Counting cycles on Cortex M0+

I have a Cortex M0+ (SAML21) board that I'm using for performance testing. I'd like to measure how many cycles a given piece of code takes. I tried using DWT (DWT_CONTROL), but it never produced a result; it returned 0 cycles regardless of what code ran.
// enable the use DWT
*DEMCR = *DEMCR | 0x01000000;
// Reset cycle counter
*DWT_CYCCNT = 0;
// enable cycle counter
*DWT_CONTROL = *DWT_CONTROL | 1 ;
// some code here
// .....
// number of cycles stored in count variable
count = *DWT_CYCCNT;
Is there a way to count cycles (perhaps with an interrupt and counter?) much like I can query for milliseconds (eg. millis() on Arduino)?
I cannot find any mention of the cycle counter register in the ARMv6-M Architecture Reference Manual.
So I'd say, this is not possible with an internal counter like it is in the bigger siblings like the M3, M4 and so on.
This is also stated in this knowledge base article:
This article was written for Cortex-M3 and Cortex-M4, but the same points apply to Cortex-M7, Cortex-M33 and Cortex-M55. Newer Cortex-M processors at the higher end of performance, such as Cortex-M55, may include an extended Performance Motnioring Unit that provides additional preformance measuring capabilities, but these are outside the scope of this article. The smaller Cortex-M processors such as Cortex-M0, Cortex-M0+ and Cortex-M23 do not include the DWT capabilities described here, and, other than the Cortex-M23, do not include ETM instruction trace, but all Cortex-M processors provide the "tarmac" capability for the chip designers.
(Emphasis mine)
So other means have to be used:
some debuggers can measure the time between hitting two breakpoints (or between two stops), the accuracy of this is usually limited by interacting with the OS, so can easily be in the order of 20 ms
use an internal timer with high enough clock frequency to give reasonable results and start / stop it before and after the interesting region
toggle a pin and measure the time with a logic analyzer / oscilloscope
According to the CMSIS header file for the M0+ (core_cm0plus.h), the Core Debug Registers are only accessible over the Debug Access Port and not via the processor. I can only suggest using some free running timer (maybe SysTick) or perhaps your debugger can be of some help to get access to the required registers.

Not able to read the pin value from Arduino Mega using PINxn

Using the register of an Arduino Mega 2560, I am trying to grab the information of the PORTA. I have referred to the datasheet (pages 69-72) and understood that I've to use PINxn (PINA) for this. But all I am getting is 0 as output. I have connected the pin to a LED.
The code and the output are mentioned below.
CODE
#define F_CPU 16000000
#include <avr/io.h>
int main(void) {
DDRA = (1 << DDA0); // sets the pin OUTPUT
__asm__("nop\n\t");
PORTA = 0x01; // Sets it HIGH
unsigned int i = PINA;
Serial.println(i);
}
OUTPUT
0
Thanks in advance for your time – if I’ve missed out anything, over- or under-emphasised a specific point let me know in the comments.
If you want to read back the value previously written to output, I recommend to read it from the register you wrote to, i.e. PORTA.
However according to provided docu (bold by me):
13.2.4 Independent of the setting of Data Direction bit DDxn, the port pin can be read through the PINxn Register bit.
A possible explanation for reading back the old value, immediatly after writing a different one, is probably the shortly following part in the same chapter:
PINxn Register bit and the preceding latch constitute a synchronizer. This is
needed to avoid metastability if the physical pin changes value near the edge of the internal clock, but it also introduces
a delay.
So you will have to account for that delay.
Have a look at timing features provided e.g. by available libraries and at available timer hardware.
But as a proof of concept, I propose to demonstrate by
print the value of PINA before writing the inverted value
write the inverted value to PORTA (inverting only the relevant bit of course)
read and print the value of PINA afterwards (hoping that your header uses volatile here) many times (say 1000)
I expect that you will see several old values, but then the new value.
Depending on how the printing is done (busy waiting?), once might be sufficient.
Your NOP (__asm__("nop\n\t");) might be designed to do the appropriate waiting. But I think it is misplaced (should be after writing new value) and it might be too short. If it is from example code, it should be sufficient. Move it, and maybe do it twice, to be sure for first try. That is likely to be effective.
You should put the "nop" in between the "PORTA = " assignment and "PINA" read. Because the instruction of writing to the PORTx register updates the status of the output pins just at the end of the system clock cycle at the rising edge of the clock generator, but reading from the PINx register returns information which is latched in an intermediate buffer. The buffer latches at the middle (i.e. at the falling edge of the clock generator) of the previous clock cycle.
So, reading from the PINx is always delayed for from 0.5 to 1.5 clock cycles.
If the logic level changed in some system clock just before it's middle (i.e. before the falling edge of the clock generator), then this value will be immediately latched, and available for read thru reading the PINx register at the next system clock cycle. Thus, the delay is 0.5 cycles
If the logic level changed just after that latching moment, then, it will be latched only in the next cycle, and will be available for reading in the cycle next after that, thus introducing the delay of 1.5 cycles
The writing to PORTx register updates the output value at the end of the clock cycle, so, it only latched in the next cycle, and will be available for reading only in next cycle after that.
The C compiler is pretty good for optimizaion, so, two consequent lines with PORTA assignment and PINA reading were compiled to just two consequent out PORTA, rxx and in ryy, PINA instructions, which cause that effect

Instruction cycle (PIC18)

I'm trying to understand the steps that it takes to go through an instruction and their relationship with each oscillator cycle. The datasheet of the PIC18F4321 seems to divide this process into 2 basic steps: fetch and execution. But it does not seem to be consistent when saying which step belongs to which oscillator cycle. For example, it says:
Internally, the program counter is incremented on every Q1; the
instruction is fetched from the program memory and latched into
the Instruction Register (IR) during Q4.
This sounds odd, because it didn't mention Q2 and Q3. From this alone I would almost be led into thinking that fetching takes 1 oscillator cycle, since it happens in Q4. But reading just a little further, it says that:
The instruction fetch and execute are pipelined in such a manner that
a fetch takes one instruction cycle, while the decode and execute
take another instruction cycle. However, due to the pipelining, each
instruction effectively executes in one cycle.
So now it is telling me that fetching takes Q1 through Q4. Based on that, I would assume that if it were not for pipelining, instructions would take 2 instructions cycles to go through, since a fully instruction cycle is for fetching alone. But I understand how in practice pipelining would make it seem like it only takes 1 instruction cycle to go through an instruction. 
Still a little bit further, and I believe this is the most confusing part, it says that:
In the execution cycle, the fetched instruction is latched into the
Instruction Register (IR) in cycle Q1. This instruction is then
decoded and executed during the Q2, Q3 and Q4 cycles. Data memory is
read during Q2 (operand read) and written during Q4
(destination write).
Based on this and other sources I have read, it seems like it divides the execution part into decoding, reading, processing and writing (it confuses me because it keeps using the word execution when I don't think it's actually referring to the execution portion of "fetch and execution").
1) Now, when does it do each? It is very clear when it says that read/write will happen in Q2/Q4. So Q3 should be processing?
2) What is the oscillator cycle for decoding?
3) Why do you have to latch the instruction to IR again in Q1 if you just did that in Q4 when you fetched for this same instruction?
disclaimer: I've never written PIC asm code, let alone done any performance analysis of a PIC. I mostly know about more powerful CPUs, like x86, from reading http://agner.org/optimize/, and stuff on http://realworldtech.com/. This answer is just based on the snippets of the manual you put in your question, because they do make sense to me. I might be completely misinterpreting something.
So in terms of the external clock, it's a 2 cycle pipeline (fetch|execute), with a quad-pumped clock in the execution core. The execution stage is subdivided into 4 pipelined stages. A bit like how Pentium4 had double-pumped execution units (i.e. one pipeline stage that uses a faster clock).
It sounds like yes, instruction execution happens in Q3.
2) What is the oscillator cycle for decoding?
I don't understand the question. It decodes one instruction per input clock, using the unmultiplied clock.
3) Why do you have to latch the instruction to IR again in Q1 if you
just did that in Q4 when you fetched for this same instruction?
It sounds like the PC is incremented in Q1, so during instruction execution it points to the next instruction. In Q4, that next instruction is done being fetched into IR in preparation for executing it next cycle. This is the instruction data itself (i.e. what PC is pointing to). I'm not sure about this part, but this makes sense.

How to recognize a start bit in asynchronous serial bit stream

I am writing some code for a microprocessor to communicate with an external device via asynchronous serial communication over a single wire.
I can recognize a transition on the wire from low/high (either way), so I can find the bit boundaries. Given that I know the baud rate the device is using I can then start clocking off bits, so I can read the stream of bits coming from the device.
What I'm struggling with conceptually is recognizing a start bit - finding the start of a byte frame (assuming I'm getting 8 bit, no parity, 1 start bit, 1 stop bit). I understand that each frame begins with a start bit and ends with a stop bit, but it is my understanding that start and stop bits look like any other bits - so there's nothing special about them that identifies them as start or stop bits (other than their position).
The only way I can think of to identify a start bit is that it will be the first high bit after a sustained idle period - that is, since I'm expecting 8 bits no parity, if I get 9 or more low bits then the line is idle, and the next high bit will be a start bit. That's all fine, but what if I start listening to the device mid-bitstream and there is no idle time of 9 bits or more on the wire? I am clocking off bits, but how do I recognize which bit is a start bit so I can read off a byte? If I'm clocking off bits, then anything in between frames can only be integer multiples of bits (so a stop "bit" can't be 1.5 bits for example), so everything just looks like bits.
I hope I'm making sense... thanks for any help.
The start bit is what gets your code to receive a byte going. Best explained with a state machine. You've got 4 basic states:
State "wait": sample the data line. When you see the start bit then start a timer at 1.5 * bit-time and move to state "data"
State "data": wait for the timer then sample the data line to record a bit. Restart the timer to 1.0 * bit-time. Repeat as long as you haven't received all bits. Move to state "stop" when all bits received
State "stop": wait for the timer and sample the data line to check the stop bit. Move to state "error" if it is wrong, add a byte to the receive buffer if it is not. Back to state "wait".
State "error": complain. Wait for deus ex machina to go back to state "wait".
So basic insights from this is that you need the start bit to get the code going that receives a byte. And that the stop bit is important so you can reliably see the start bit for the next byte.
I would make your program read the sequences and decipher patterns at one point or another, moving 9 bits back and forth, the data will make sense, depending on the chains you are transmitting. Once a pattern is recognized, maybe recognizing if the data has one period, then I would look if at any point the line of bits that makes a byte match to the period code (ascii 46) then count spaces back and forth and establish the start and stop bits.

Resources