Immediatea Addressing mode used in instructions containing memory locations - cpu-registers

Suppose we have:
MOV #NUM, R0
I understand that the hashtag represents an immediate addressing mode. However, what I don't understand what exactly gets stored in R0 in this case. Is it the actual address of NUM?

Related

In assembly, Is it always necessary to specify "DWORD/WORD/BYTE..." when using "mov/lea"? [duplicate]

I am just starting studying Assembly (x86, NASM) in university and I am really confused about how it works. Out of the many questions I have about it, this keeps bugging me.
When do I need to speficify the size of the operand? Is there a rule? For example:
segment data use32 class=data
a db 10
b dw 40
segment code use32 class=code
start:
mov AX, [b]
div BYTE [a]
Here we specified the size of the operand in the div opcode as BYTE. If I delete that BYTE part, I get an error, so we need to specify it.
segment data use32 class=data
a db 10
b dw 40
segment code use32 class=code
start:
mov AH, 2
mul AH
Here, we didn't need to specify the size of the operand 2. It just works.
So when do I have to specify the size? Is it as simple as: when I have a variable declared in memory, specify its size? Considering the examples given above, I am inclined to think so, but through my short experience with Assembly I found that it tends to defy my logic as to how things should work.
Also, after illuminating me about when we need to specify the size, can you please also tell me WHY we need to do this? When we need to do it, why do we need to do it? I mean, we already declared the variable, so the type of the variable should be visible to the program, shouldn't it? Why do we need to specify the size, else we get an error?
You don't need to specify an operand size if it can be inferred from something else you did specify. For example, mov only works on two operands of the same size, and AX is a word-sized register, so in mov AX, [b], it can infer that [b] must be word-sized. But you only specify one operand to div, so you have to tell it what size [a] is since it doesn't have any information to infer it from.

Dereference pointers in XMM register (gather)

If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory?
Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with the four sections of an XMM register and return four results at once to another register. Or as close to "at once" as possible. (/edit)
Edit2: thanks to PaulR's answer I guess the terminology I'm looking for is "gather", and the question therefore is "what's the best way to implement gather for systems pre-AVX2?".
I assume there isn't an instruction for this since ...well, one doesn't appear to exist as far as I can tell and anyway it doesn't seem to be what SSE is designed for at all.
("Pointer-like value" meaning something like an integer index into an array pretending to be the heap; mechanically very different but conceptually the same thing. If, say, one wanted to use 32-bit or even 16-bit values regardless of the native pointer size, to fit more values in a register.)
Two possible reason I can think of why one might want to do this:
thought it might be interesting to explore using the SSE registers for general-purpose... stuff, perhaps to have four identical 'threads' processing potentially completely unrelated/non-contiguous data, slicing through the registers "vertically" rather than "horizontally" (i.e. instead of the way they were designed to be used).
to build something like romcc if for some reason (probably not a good one), one didn't want to write anything to memory, and therefore would need more register storage.
This might sound like an XY problem, but it isn't, it's just curiosity/stupidity. I'll go looking for nails once I have my hammer.
The question is not entirely clear, but if you want to dereference vector register elements then the only instructions which might help you here are AVX2's gathered loads, e.g. _mm256_i32gather_epi32 et al. See the AVX2 section of the Intel Intrinsics Guide.
SYNOPSIS
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include "immintrin.h"
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flag : AVX2
DESCRIPTION
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
OPERATION
FOR j := 0 to 7
i := j*32
dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:256] := 0
So if I understood this correctly, your title is misleading and you really want to:
index into the concatenation of all XMM registers
with an index held in a part of an XMM register
Right?
That's hard. And a little weird, but I'm OK with that.
Assuming crazy tricks are allowed, I propose self-modifying code: (not tested)
pextrb eax, xmm?, ? // question marks are the position of the pointer
mov edx, eax
shr eax, 1
and eax, 0x38
add eax, 0xC0 // C0 makes "hack" put its result in eax
mov [hack+4], al // xmm{al}
and edx, 15
mov [hack+5], dl // byte [dl] of xmm reg
call hack
pinsrb xmm?, eax, ? // put value back somewhere
...
hack:
db 66 0F 3A 14 00 00 // pextrb ?, ? ,?
ret
As far as I know, you can't do that with full ymm registers (yet?). With some more effort, you could extend it to xmm8-xmm15. It's easily adjustable to other "pointer" sizes and other element sizes.

hazards in a three-way superscalar pipeline

I am working my way though exercises relating to superscalar architecture. I need some help conceptualizing the answer to this question:
“If you ever get confused about what a register renamer has to do, go back to the assembly code you're executing, and ask yourself what has to happen for the right result to be obtained. For example, consider a three-way superscalar machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has executed?”
I can look at that and see that, ok, the final value of R1 should be 40. How would a three-way superscalar machine reach this answer though? If I understand them correctly, in this three-way superscalar pipeline, these three instructions would be fetched in parallel. Meaning, you would have a hazard right from start, right? How should I conceptualize the answer to this problem?
EDIT 1: When decoding these instructions, the three-way superscalar machine would, by necessity, have to perform register renaming to get the following instruction set, correct:
ADDI R1, R2, R3
ADDI R4, R5, R6
ADDI R1, R2, R3
Simply put - you won't be able to perform these instructions together. However, the goal of this example doesn't seem to do with hazards (namely - detecting that these instruction are interdependent and must be performed serially with sufficient stalls), it's about renaming - it serves to show that a single logical register (R1) will have multiple physical "versions" in-flight simultaneously in the pipeline. The original one would have the value 5 (lets call it "p1"), but you'll also need to allocate one for the result of the first ADD ("p2"), to be used as source for the second, and again for the results of the second and third ADD instructions ("p3" and "p4").
Since this processor decodes and attempts to issue these 3 instructions simultaneously, you can see that you can't just have R1 as the source for all - that would prevent each of them from using the correct mid-calculation value, so you need to rename them. The important part is that p1..p4 as we've dubbed them, can be allocated simultaneously, and the dependencies would be known at the time of issue - long before each of them is populated with the result. This essentially decouples the front-end from the execution back-end, which is important for the performance flexibility in modern CPUs as you may have bottlenecks anywhere.

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.

Assembler memory address representation

I'm trying to get into assembler and I often come across numbers in the following form:
org 7c00h
; initialize the stack:
mov ax, 07c0h
mov ss, ax
mov sp, 03feh ; top of the stack.
7c00h, 07c0h, 03feh - What is the name of this number notation? What do they mean? Why are they used over "normal" decimal numbers?
It's hexadecimal, the numeral system with 16 digits 0-9 and A-F. Memory addresses are given in hex, because it's shorter, easier to read, and the numbers that represent memory locations don't mean anything special to humans, so no sense to have long numbers. I would guess that somewhere in the past someone had to type in some addresses by hand as well, might as well have started there.
Worth noting also, 0:7C00 is the boot sector load address.
Further worth noting: 07C0:03FE is the same address as 0:7FFE due to the way segmented addressing works.
This guy's left himself a 510 byte stack (he made the very typical off-by-two error in setting up the boot sector's stack).
These are numbers in hexadecimal notation, i.e. in base 16, where A to F have the digit values 10 to 15.
One advantage is that there is a more direct conversion to binary numbers. With a little bit of practice it is easy to see which bits in the number are 1 and which are 0.
Another is is that many numbers used internally, such as memory addresses, are round numbers in hexadecimal, i.e. contain a lot of zeros.

Resources