ARMv8 Foundation Model: switches and leds - simulator

I am trying to boot my small ARMv7 kernel (which runs just fine using qemu vexpress model) in ARMv8 Foundation Model v2.1. The model boots at level EL3 / 64 bits, and I managed to go down to level EL1 / 32 bits, but I encounter some issues (in a few words, the timer doesn't tick and some kprintf are missing, but that's not the issue here).
To debug my UART issue, I wanted to use the led / switches provides by the model. I can read their value from software quite easily, but I can't write a new value to either of them. The kernel seems to hang. Here is a minimal asm code that writes to the switches register:
.global Start
Start:
# we are in EL3 / 64 bits mode
# create the 0x1C010000 + 0x4 address of switches
mov x0, #4
movk x0, #0x1c01, lsl #16
# value to write
mov w1, #0xaa
# actual writing
strb w1, [x0]
It seems I am stuck at the strb instruction. For the record, if I replace strb with ldrb, I can correctly read and display the value of this register (I played with the --switches flag to be sure it worked).
Any one knows what I am doing wrong here ?
EDIT: thanks to unixsmurf suggestions, I know now that I got an synchronous Data Abort Exception with no level change, and that the reason is "Synchronous External Abort". I don't know how to inspect further, I guess I'll try ARM's forum.
Best,
V.

The ARM community finally solved the problem. The complete discussion can be found here.

Related

MSP430 instruction cycles

Hi i am working on Tmote sky motes (MSP430 microprocessor) with contiki os. I want to know the number of instruction cycles used when I do a multiplication operation in my programming (software).
Thank you,
Avijit
The msp430 is a 16-bit system, so 32-bit values are not supported directly. A 32-bit operation is typically translated to assembly code as a sequence of 16-bit ops.
The execution times of 8-bit and 16-bit operations can be found in TI application report "The MSP430Hardware Multiplier":
Table 4. CPU Cycles Needed With Different Multiplication Modes
OPERATION: Unsigned Multiply (MPY)
SOFTWARE LOOP: 139...171
HARDWARE MPYer: 8
SPEED INCREASE: 17.4...21.4
OPERATION: Unsigned multiply-and-accumulate (MAC)
SOFTWARE LOOP: 137...169
HARDWARE MPYer: 8
SPEED INCREASE: 17.1...21.1
OPERATION: Signed Multiply (MPYS)
SOFTWARE LOOP: 145...179
HARDWARE MPYer: 8
SPEED INCREASE: 18.1...22.4
OPERATION: Signed multiply-and-accumulate (MAC)
SOFTWARE LOOP: 143...177
HARDWARE MPYer: 17
SPEED INCREASE: 8.4...10.4
The HW multiplier should be active with default compilation settings, but check the generated object file with msp430-objdump to make sure.
You can use naken_asm by Michael Kohn to disaemble an Intel hex or ELF file and it will calculate the cycle counts for each instruction. I've used it in the past and the cycle counter is OK for CPU (such as in your Tmote) but not fully supported in CPUX.
You can invoke it from the command line as simply as:
naken_util -disasm <infile>
where <infile> is the name of your hex or ELF file. The default processor is MSP430, but you'd need the assembly listing from your compiler in order to be able to match up the original code with the disassembled code which includes cycle counts.
Another alternative would be to use MSPDebug's tracer option which can track running software and provide an up-to-date instruction cycle count. However, I've never used it for that purpose so cannot provide an example.

hazards in a three-way superscalar pipeline

I am working my way though exercises relating to superscalar architecture. I need some help conceptualizing the answer to this question:
“If you ever get confused about what a register renamer has to do, go back to the assembly code you're executing, and ask yourself what has to happen for the right result to be obtained. For example, consider a three-way superscalar machine renaming these three instructions concurrently:
ADDI R1, R1, R1
ADDI R1, R1, R1
ADDI R1, R1, R1
If the value of R1 starts out as 5, what should its value be when this sequence has executed?”
I can look at that and see that, ok, the final value of R1 should be 40. How would a three-way superscalar machine reach this answer though? If I understand them correctly, in this three-way superscalar pipeline, these three instructions would be fetched in parallel. Meaning, you would have a hazard right from start, right? How should I conceptualize the answer to this problem?
EDIT 1: When decoding these instructions, the three-way superscalar machine would, by necessity, have to perform register renaming to get the following instruction set, correct:
ADDI R1, R2, R3
ADDI R4, R5, R6
ADDI R1, R2, R3
Simply put - you won't be able to perform these instructions together. However, the goal of this example doesn't seem to do with hazards (namely - detecting that these instruction are interdependent and must be performed serially with sufficient stalls), it's about renaming - it serves to show that a single logical register (R1) will have multiple physical "versions" in-flight simultaneously in the pipeline. The original one would have the value 5 (lets call it "p1"), but you'll also need to allocate one for the result of the first ADD ("p2"), to be used as source for the second, and again for the results of the second and third ADD instructions ("p3" and "p4").
Since this processor decodes and attempts to issue these 3 instructions simultaneously, you can see that you can't just have R1 as the source for all - that would prevent each of them from using the correct mid-calculation value, so you need to rename them. The important part is that p1..p4 as we've dubbed them, can be allocated simultaneously, and the dependencies would be known at the time of issue - long before each of them is populated with the result. This essentially decouples the front-end from the execution back-end, which is important for the performance flexibility in modern CPUs as you may have bottlenecks anywhere.

Does OpenCL always zero-initialize device memory?

I've noticed that often, global and constant device memory is initialized to 0. Is this a universal rule? I wasn't able to find anything in the standard.
No it doesn't. For instance I had this small kernel to test atomic add:
kernel void atomicAdd(volatile global int *result){
atomic_add(&result[0], 1);
}
Calling it with this host code (pyopencl + unittest):
def test_atomic_add(self):
NDRange = (4, 4)
result = np.zeros(1, dtype=np.int32)
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY, size=result.nbytes)
self.prog.atomicAdd(self.queue, NDRange, NDRange, out_buf)
cl.enqueue_copy(self.queue, result, out_buf).wait()
self.assertEqual(result, 16)
was always returning the correct value when using my CPU. However on a ATI HD 5450 the returned value was always junk.
And If I well recall, on an NVIDIA the first run was returning the correct value, i.e. 16, but for the following run, the values were 32, 48, etc. It was reusing the same location with the old value still stored there.
When I corrected my host code with this line (copying the 0 value to the buffer):
out_buf = cl.Buffer(self.ctx, self.mf.WRITE_ONLY | self.mf.COPY_HOST_PTR, hostbuf=result)
Everything worked fine on any devices.
As far as I know there is no sentence in standard that states this.
Maybe some driver implementations will do this automatically, but you shoudn't rely on it.
I remember that once I had a case where a buffer was not initialized to 0, but I can't remember the settings of "OS + driver".
Probably what is going on is that the typical OS does not use even 1% of now a days devices memory. So when you start a OpenCL, there is a huge probability that you will fall into an empty zone.
It depends on the platform that you are developing. As #DarkZeros mentioned in the previous answers, the spec does not imply anything. Please see page 104 of OpenCL 2.1 Spec.
However, based on our experience in Mali GPUs, the driver initializes all elements of the newly allocated buffers to zero. This is for the first touch. Later on, as time goes by, and we release this buffer and its memory space is occupied by a new buffer, that memory space is not initialized with zero. "Again, the first touch sees the zero values. After that, you would see normal gibberish values."
Hope this helps after such long time!

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.

Strange CUDA behavior in vector multiplication program

I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.
The problem is the following
I first call the kernel with N=16 and nThreads<16 which doesn't work. (This is ok)
Then I call it with N=16 and nThreads=16 which works fine. (Again
works as expected)
But when I call it with N=16 and nThreads<16 it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?
Wait, so are you calling all three in a row? I don't know the rest of your code, but are you sure you're clearing out the graphics memory you alloced between each run? If not, that could explain why it doesn't work the first time but does the third time when you're passing the same values, and why it only works again after rebooting (rebooting clears all the memory alloced).
Don't know if its ok to answer my own question but I realized I had a bug in my code when comparing the host and device vectors (that part of the code wasn't posted). Sorry for the inconvenience. Could someone please close this post since it won't let me delete it?

Resources