high-speed case construct assembler + load DPTR fast - 8051 - pointers

I'm currently implementing a serial routine for an 8051 IC (specifically AT89C4051) and I don't have much stack space or memory left, and in order for me to achieve a decent baud rate on the serial port (38K or better), I need to make a high speed case construct since in my serial port interrupt routine, I'm building a packet and checking it for validity.
Assume we are in the serial interrupt and R0 is the address to the memory space in which data is to be received. Let's assume start address is 40h
So here we go with a bunch of compares:
Branching via many compares
serial:
mov A,SBUF
mov #R0,A
mov A,R0
anl A,#07h ;our packet is 8 bytes so get current packet # based on what we stored so far
cjne A,#0h,nCheckMe ;this gets scanned if A=7... waste 2 clock cycles
//We're working on first byte
ajmp theend
nCheckMe:
cjne A,#1h,nCheckThem ;this gets scanned if A=7... waste 2 clock cycles
//We're working on second byte
ajmp theend
nCheckThem:
...
cjne A,#7h,nCheckEnd
//We're working on last byte
ajmp theend
nCheckEnd:
theend:
inc R0
reti
The above code might be practical at first but as the current byte in the packet to work on increases, the routine runs 2 clock cycles slower each time because of the extra "cjne" instruction processing. For example, if we are on the 7th byte, then "cjne" would happen many times because it has to scan through each case which adds slowness.
Branching via jump
Now I thought of using just a jump but I can't figure out how to load DPTR at high speed because the interrupt can get called even when some other process is using the value of DPTR.
I thought of this code:
serial:
mov A,SBUF
mov #R0,A
mov A,R0
anl A,#07h ;our packet is 8 bytes so get current packet # based on what we stored so far
swap A ;multiply A times 16 and
rr A ;divide A by 2 so we get address times 8 since each block uses 8 bytes of code space.
mov R3,DPH ;save DPTR high byte without breaking stack
mov R6,DPL ;save DPTR low byte
mov dptr,#table
jmp #A+DPTR
theend:
mov DPL,R6 ;restore DPTR low byte
mov DPH,R3 ;restore DPTR high byte
inc R0 ;move on to next position
reti
table:
;insert 8 bytes worth of code for 1st case
;insert 8 bytes worth of code for 2nd case
;insert 8 bytes worth of code for 3rd case
...
;insert unlimited bytes worth of code for last case
In my code, R3 and R6 were free so I used them to store the old DPTR value but those mov instructions as well as loading the new DPTR value take 2 cycles each for 10 cycles total (including restoring old value).
Is there a faster way to process a case construct in 8051 assembly code so that my serial routine processes faster?

Don't run logic in the ISR if possible. If you insist, you might be able to assign DPTR to the ISR and only use it in very short pieces of normal code with interrupts disabled. Alternatively, a PUSH+RET trick could work.
Here is a chained approach, where each processed character just sets the address for the next step. If you can ensure the steps are within the same 256 byte block, you only ever need to update the low byte. The total overhead is 8 cycles, but you also save the 4 cycles for the arithmetic so it's a win of 6 cycles.
.EQU PTR, 0x20 ; two bytes of SRAM
; Initialize PTR to address of step1
serial:
push PTR
push PTR+1
ret
step1:
; do stuff
mov PTR, #low8(step2)
reti
last_step:
; do stuff
mov PTR, #low8(step1)
reti

Related

SPI transaction terminates early - ESP-IDF

An ESP32 app using ESP-IDF (ESP32 SDK) communicates with two SPI slaves on the same SPI bus (ILI9341 TFT driver, NRF24L01+ RF transceiver). Overall, it works great. However, some of the data received from the RF transceiver is truncated, i.e. only the first few bytes are correct and the rest is garbage.
The problem is more or less reproducible and only occurs if there is SPI communication with the other slave (TFT driver) immediately before receiving the truncated data.
The problematic SPI transaction is a full-duplex transaction that sends a command byte and 10 dummy bytes while receiving 10 bytes. It uses the VSPI bus and DMA channel 1. If the problem occurs, only the first few bytes are correct while the last 2 to 6 bytes are invalid (0 or the value of the dummy bytes).
I dug into the SDK code (spi_master.c), added debug code and observed a surprising value in the DMA's lldesc_t struct:
At transaction start, it is initialized with length = 0x0c and size = 0x0c. 0x0c is 12 bytes, i.e. the 10 bytes rounded to the next word.
At transaction end, the values are length = 0x07 and size = 0x0c (length can vary slightly). So the transaction only reads 7 bytes and then somehow terminates. Or rather the DMA operations terminates.
Would you agree that the data indicates an early termination?
What could be the cause for the early termination?
Are there some registers that could indicate the cause of the
problem?
The code is pretty straightforward:
uint8_t* buffer = heap_caps_malloc(32, MALLOC_CAP_DMA);
...
memset(buffer, CMD_NOP, len);
spi_transaction_t trx;
memset(&trx, 0, sizeof(spi_transaction_t));
trx.cmd = 0x61;
trx.tx_buffer = buffer;
trx.length = 8 * 10;
trx.rx_buffer = buffer;
trx.rxlength = 8 * 10;
esp_err_t ret = spi_device_transmit(spi_device, &trx);
It seems that the following warning – found in the SPI Slave driver documentation – also applies to a SPI master receiving data from a slave:]
Warning: Due to a design peculiarity in the ESP32, if the amount of
bytes sent by the master or the length of the transmission queues in
the slave driver, in bytes, is not both larger than eight and
dividable by four, the SPI hardware can fail to write the last one to
seven bytes to the receive buffer.
I've now changed the sender side to send at least 12 bytes and multiples of 4 and the problem is gone.
Let me now if you think it just works because of luck and my assumption is wrong.

how to get address of variable and dereference it in nasm x86 assembly?

in c language we use & to get the address of a variable and * to dereference the variable.
int variable=10;
int *pointer;
pointer = &variable;
How to do it in nasm x86 assembly language.
i read nasm manual and found that [ variable_address ] works like dereferencing.( i maybe wrong ).
section .data
variable db 'A'
section .text
global _start
_start:
mov eax , 4
mov ebx , 1
mov ecx , [variable]
mov edx , 8
int 0x80
mov eax ,1
int 0x80
i executed this code it prints nothing. i can't understand what is wrong with my code.
need your help to understand pointer and dereferencing in nasm x86.
There are no variables in assembly. (*)
variable db 'A'
Does several things. It defines assembly-time symbol variable, which is like bookmark into memory, containing address of *here* in the time of compilation. It's same thing as doing label on empty line like:
variable:
The db 'A' directive is "define byte", and you give it single byte value to be defined, so it will produce single byte into resulting machine code with value 0x41 or 65 in decimal. That's the value of big letter A in ASCII encoding.
Then:
mov ecx , [variable]
Does load 4 bytes from memory cells at address variable, which means the low 8 bits ecx will contain the value 65, and the upper 24 bits will contain some junk which happened to reside in the following 3 bytes after the 'A' .. (would you use db 'ABCD', then the ecx would be equal to value 0x44434241 ('D' 'C' 'B' 'A' letters, "reversed" in bits due to little-endian encoding of dword values on x86).
But the sys_write expect the ecx to hold address of memory, where the content bytes are stored, so you need instead:
mov ecx, variable
That will in NASM load address of the data into ecx.
(in MASM/TASM this would instead assemble as mov ecx,[variable] and to get address you have to use mov ecx, OFFSET variable, in case you happen to find some MASM/TASM example, be aware of the syntax difference).
*) some more info about "no variables". Keep in mind in assembly you are on the machine level. On the machine level there is computer memory, which is addressable by bytes (on x86 platform! There are some platforms, where memory may be addressable by different size, they are not common, but in micro-controllers world you may find some). So by using some memory address, you can access some particular byte(s) in the physical memory chip (which particular physical place in memory chip is addressed depends on your platform, the modern OS will usually give user application virtual addressing space, translated to physical addresses by CPU on the fly, transparently, without bothering user code about that translation).
All the advanced logical concepts like "variables", "arrays", "strings", etc... are just bunch of byte values in memory, and all that logical meaning is given to the memory data by the instructions being executed. When you look at those data without the context of the instructions, they are just some byte values in memory, nothing more.
So if you are not precise with your code, and you access single-byte "variable" by instruction fetching dword, like you did in your mov ecx,[variable] example, there's nothing wrong about that from the machine point of view, and it will happily fetch 4 bytes of memory into ecx register, nor the NASM is bothered to report you, that you are probably out-of-bounds accessing memory beyond your original variable definition. This is sort of stupid behaviour, if you think in terms like "variables", and other high-level programming languages concepts. But assembly is not intended for such work, actually having the full control over machine is the main purpose of assembly, and if you want to fetch 4 bytes, you can, it's all up to programmer. It just requires tremendous amount of precision, and attention to detail, staying aware of your memory structures layout, and using correct instructions with desired memory operand sizes, like movzx ecx,byte [variable] to load only single byte from memory, and zero-extend that value into full 32b value in the target ecx register.

Understanding STM8 pipelining

I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.
I have this example, where I toggle a GPIO pin for 4 cycles each.
Iff loop is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?
// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
__asm
bset 0x5011, #2 ; output mode
bset 0x5012, #2 ; push-pull
bset 0x5013, #2 ; fast switching
jra _loop
.bndry 4
nop
nop
nop
_loop:
nop
bset 0x500f, #2
nop
nop
nop
bres 0x500f, #2
jra _loop
__endasm;
}
A bit more context:
bset/bres are 4 byte instructions, nop 1 byte.
The nop/bset/bres instructions take 1 cycle each.
The jra instruction takes two cycles. I think in the first cycle, the instruction cache is filled with the next 32bit value, i.e. in this case the nop instruction only. And the 2nd cycle is actually just the CPU being stalled while decoding the next instruction.
So in cycles:
bres clears the pin
jra, pipeline flush, nop fetch
nop decode, bset fetch
nop execute, bset decode, next nop fetch
bset execute sets the pin
nop, bres fetch
nop
nop, bres decode
bres execute clears the pin
According to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.
In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.
I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset instruction (the nops thereafter provide enough time to make sure that bres later is ready to execute immediately). But according to my understanding nop (for 6.) would already be fetched in 4.
Any idea how this behavior can be explained? I couldn’t find any hints in the manual.
It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.
Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).
Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.
Any additional insight appreciated!

What happens to instruction pointers when address overrides are used to target a smaller address space?

What happens to instruction pointers when address overrides are used to target a smaller address space e.g. the default is 32-bit address but the override converts to 16?
So, let's say we're in x86-32 mode and the default is a 32-bit memory space for the current code segment we're in.
Further, the IP register contains the value 87654321h.
If I use 67h to override the default and make the memory space 16-bit for just that one instruction, how does the processor compute the offset into the current code segment?
Some bits in the IP have to be ignored, otherwise you'd be outside the 16-bit memory space specified by the override.
So, does the processor just ignore the 8765 part in the IP register?
That is, does the processor just use the 4 least significant bits and ignore the 4 most significant bits?
What about address overrides associated with access to data segments?
For example, we're in x86-32 mode, the default is 32 bit memory addressing and we use 67h prefix for this instruction: mov eax, [ebx].
Now, ebx contains a 32 bit number.
Does the 67h override change the above instruction to: mov eax, [bx]?
What about "constant pointers"? Example: mov eax, [87654321].
Would the 67h override change it to mov eax, [4321]?
Does the memory override affect the offset into the data segment also or just the code segment?
How do address overrides affect the stack pointer?
If the stack pointer contains a 32 bit number (again we'll use 87654321h) and I push or pop, what memory is referenced?
Pushing and popping indirectly accesses memory.
So, would you only use the 4321 bits in the IP register ignoring the most significant bits?
Also, what about the segment bases themselves?
Example: we're in x86-32 mode, default 32 bit memory space, but we use 67h override.
The CS register points to a descriptor in the GDT whose segment base is, again lol, 87654321h.
We're immediately outside of the 16-bit memory range without even adding an offset.
What does the processor do? Ignore the 4 most significant bits? The same question can be applied to the segment descriptors for the data and stack segments.
0x67 is the address-size prefix. It changes the interpretation of an addressing mode in the instruction.
It does not put the machine temporarily into 16-bit mode or truncate EIP to 16-bit, or affect any other addresses that don't explicitly come from an [addressing mode] in the instruction.
For push/pop, the instruction reference manual entry for push says:
The address size is used only when referencing a source operand in memory.
So in 32-bit mode, a16 push eax would still set esp-=4 and then store [esp] = eax. It would not truncate ESP to 16 bits. The prefix would have no effect, because the only memory operand is implicit not explicit.
push [ebx] is affected by the 67 prefix, though.
db 0x67
push dword [ebx]
would decode as push dword [bp+di], and load 32 bits from that 16-bit address (ignoring the high 16 of those registers). (16-bit addressing modes use a different encoding than 32/64 (with no optional SIB byte).
However, it would still update the full esp, and store to [esp].
(For the effective-address encoding details, see Intel's volume 2 PDF, Chapter 2: INSTRUCTION FORMAT, table 2-1 (16-bit) vs. table 2-2 (32-bit).)
In 64-bit mode, the address-size prefix would turn push [rbx] into push [ebx]).
Since some forms of push can be affected by the address-size prefix, this might not fall into the category of meaningless prefixes, use of which is reserved and may produce unpredictable behaviour in future CPUs. (What happens when you use a memory override prefix but all the operands are registers?). OTOH, that may only apply to the push r/m32 opcode for push, not for the push r32 short forms that can't take a memory operand.
I think the way it's worded, Intel's manual really doesn't guarantee that even the push r/m32 longer encoding of push ebx wouldn't decode as something different in future CPUs with a 67 prefix.
For example, we're in x86-32 mode, the default is 32 bit memory addressing and we use 67h prefix for this instruction: mov eax, [ebx].
Now, ebx contains a 32 bit number.
Does the 67h override change the above instruction to: mov eax, [bx]?
What about "constant pointers"? Example: mov eax, [87654321].
Would the 67h override change it to mov eax, [4321]?
The address size override doesn't just change the size of the address, it actually changes the addressing scheme.
A 67 override on mov eax, [ebx] changes it to mov eax, [bp+di].
A 67 override on mov eax, [87654321] changes it to mov eax, [di] (followed by and [ebx+65], eax and some xchg instruction).

Serial Communication 8051

I am studying serial communication in 8051 using UART and interrupts. Today I came across this code in which author says he is constantly transfering data coming on Port 0. The way transfer is occuring, I think is voilating the rules of serial communication in 8051.
org 00h
ljmp main
org 23h
ljmp serial_ISR
org 30h
main:
mov TMOD,#20h
mov TH1,#-03h
mov SCON,#50h
setb IE.7
setb IE.4
setb TR1
back:
mov A,P0
mov SBUF,A
sjmp back
serial_ISR:
jb TI,trans
mov R0,SBUF
clr RI
RETI
trans:
clr TI
RETI
The thing that is confusing me is, in back label we constantly writing on SBUF register which is voilating the rule that we should not write on SBUF until the previous data has been sent (which is notified by the TI flag).
Is constantly writing data on SBUF register in above code valid? will UART send correct data?
Regards
You are definitely right, the code inside back label should be rewritten like this:
back:
jb TI,$
mov A,P0
mov SBUF,A
sjmp back
Coding back label like I did before guarantee you that you are not going to move any data to SBUF until it finishes sending the last data.
There is one detail here to take into account, remember that serial port interrrupts (by receiving or transmitting) are not cleared automatically, so in the code before I am assuming that you cleared the TI interrupt flag manually.

Resources