MSP430 instruction cycles - msp430

Hi i am working on Tmote sky motes (MSP430 microprocessor) with contiki os. I want to know the number of instruction cycles used when I do a multiplication operation in my programming (software).
Thank you,
Avijit

The msp430 is a 16-bit system, so 32-bit values are not supported directly. A 32-bit operation is typically translated to assembly code as a sequence of 16-bit ops.
The execution times of 8-bit and 16-bit operations can be found in TI application report "The MSP430Hardware Multiplier":
Table 4. CPU Cycles Needed With Different Multiplication Modes
OPERATION: Unsigned Multiply (MPY)
SOFTWARE LOOP: 139...171
HARDWARE MPYer: 8
SPEED INCREASE: 17.4...21.4
OPERATION: Unsigned multiply-and-accumulate (MAC)
SOFTWARE LOOP: 137...169
HARDWARE MPYer: 8
SPEED INCREASE: 17.1...21.1
OPERATION: Signed Multiply (MPYS)
SOFTWARE LOOP: 145...179
HARDWARE MPYer: 8
SPEED INCREASE: 18.1...22.4
OPERATION: Signed multiply-and-accumulate (MAC)
SOFTWARE LOOP: 143...177
HARDWARE MPYer: 17
SPEED INCREASE: 8.4...10.4
The HW multiplier should be active with default compilation settings, but check the generated object file with msp430-objdump to make sure.

You can use naken_asm by Michael Kohn to disaemble an Intel hex or ELF file and it will calculate the cycle counts for each instruction. I've used it in the past and the cycle counter is OK for CPU (such as in your Tmote) but not fully supported in CPUX.
You can invoke it from the command line as simply as:
naken_util -disasm <infile>
where <infile> is the name of your hex or ELF file. The default processor is MSP430, but you'd need the assembly listing from your compiler in order to be able to match up the original code with the disassembled code which includes cycle counts.
Another alternative would be to use MSPDebug's tracer option which can track running software and provide an up-to-date instruction cycle count. However, I've never used it for that purpose so cannot provide an example.

Related

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required.
Why use SIMD if we have GPGPU?
SIMD intrinsics - are they usable on gpus?
CPU SIMD vs GPU SIMD?
Are following points correct,if I compile the code in SIMD-8 mode ?
1) it means 8 instructions of different work items are getting executing in parallel.
2) Does it mean All work items are executing the same instruction only?
3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean to say is it true GPU is till executing the same instruction (either vload16/ float16 / vstore16) for all 8 work items?
How should I understand this concept?
In the past many OpenCL vendors required to use vector types to be able to use SIMD. Nowadays OpenCL vendors are packing work items into SIMD so there is no need to use vector types. Whether is preffered to use vector types can be checked by querying for: CL_DEVICE_PREFERRED_VECTOR_WIDTH_<CHAR, SHORT, INT, LONG, FLOAT, DOUBLE>.
On Intel if vector type is used the vectorizer first scalarize them and then re-vectorize to make use of the wide instruction set. This is probably going to be similar on the other platforms.

Optimal NEON vector structure for processing vectors of uint8_t type with Arm Cortex-A8 (32-bit)

I am doing some image processing on an embedded system (BeagleBone Black) using OpenCV and need to write some code to take advantage of NEON optimization. Specifically, I would like to write a NEON optimized thresholding function and then a NEON optimized erosion/dilation function.
This is my first time writing NEON code and I don't have experience writing assmbly code, so I have been looking at examples and resources for the C-style NEON intrinsics. I believe that I can put some working code together, but am not sure how I should structure the vectors. According to page 2 of the "ARM NEON support in the ARM compiler" white paper:
"These registers can hold "vectors" of items which are 8, 16, 32 or 64
bits. The traditional advice when optimizing or porting algorithms
written in C/C++ is to use the natural type of the machine for data
handling (in the case of ARM 32 bits). The unwanted bits can then be
discarded by casting and/or shifting before storing to memory."
What exactly does this mean? Do I need to to restrict my NEON code to using uint32x4_t vectors rather than uint8x16_t? How would I go about loading the registers? Or does this mean than I need to take some special steps when using vst1q_u8 to store the data to memory?
I did find this example, which is untested but uses the uint8x16_t type. Does it adhere to the "32-bit" advice given above?
I would really appreciate it if someone could please elaborate on the above quotation and maybe provide a very simple working example.
The next sentence from the document you linked gives your answer.
The ability of NEON to specify the data width in the instruction and
hence use the whole register width for useful information means
keeping the natural type for the algorithm is both possible and
preferable.
Note, the document is distinguishing between the natural type of the machine (32-bit) and the natural type of the algorithm (in your case uint8_t).
The document is saying that in the past you would have written your code in such a way that it used 32-bit integers so that it could use the efficient machine instructions suited for 32-bit operations.
With Neon, this is not necessary. It is more useful to use the data type you actually want to use, as Neon can efficiently operate on those data types.
It will depend on your algorithm as to the optimal choice of register width (uint8x8_t or uint8x16_t).
To give a simple example of using the Neon intrinsics to add two sets of uint8_t:
#include <arm_neon.h>
void
foo (uint8_t a, uint8_t *b, uint8_t *c)
{
uint8x16_t t1 = vld1q_u8 (a);
uint8x16_t t2 = vld1q_u8 (b);
uint8x16_t t3 = vaddq_u8 (a, b);
vst1q_u8 (c, t3);
}

How does a processor calculate bigger than its register value?

So far I learned that a processor has registers, for 32 bit processor
they are 32 bits, for 64 bit they are 64 bits. So can someone explain
what happens if I give to the processor a larger value than its register
size? How is the calculation performed?
It depends.
Assuming x86 for the sake of discussion, 64-bit integers can still be handled "natively" on a 32-bit architecture. In this case, the program often uses a pair of 32-bit registers to hold the 64-bit value. For example, the value 0xDEADBEEF2B84F00D might be stored in the EDX:EAX register pair:
eax = 0x2B84F00D
edx = 0xDEADBEEF
The CPU actually expects 64-bit numbers in this format in some cases (IDIV, for example).
Math operations are done in multiple instructions. For example, a 64-bit add on a 32-bit x86 CPU is done with an add of the lower DWORDs, and then an adc of the upper DWORDs, which takes into account the carry flag from the first addition.
For even bigger integers, an arbitrary-precision arithmetic (or "big int") library is used. Here, a dynamically-sized array of bytes is used to represent the integer, with additional information (like the number of bits used). GMP is a popular choice.
Mathematical operations on big integers are done iteratively, probably in native word-size values at-a-time. For the gory details, I suggest you have a look through the source code of one of these open-source libraries.
The key to all of this, is that numeric operations are carried out in manageable pieces, and combined to produce the final result.

Code profiling to improve performance : see CPU cycles inside mscorlib.dll?

I made a small test benchmark comparing .NET's System.Security.Cryptography AES implementation vs BouncyCastle.Org's AES.
Link to GitHub code: https://github.com/sidshetye/BouncyBench
I'm particularly interested in AES-GCM since it's a 'better' crypto algorithm and .NET is missing it. What I noticed was that while the AES implementations are very comparable between .NET an BouncyCastle, the GCM performance is quite poor (see extra background below for more). I suspect it's due to many buffer copies or something. To look deeper, I tried profiling the code (VS2012 => Analyze menu bar option => Launch performance wizard) and noticed that there was a LOT of CPU burn inside mscorlib.dll
Question: How can I figure out what's eating most of the CPU in such a case? Right now all I know is "some lines/calls in Init() burn 47% of CPU inside mscorlib.ni.dll" - but without knowing what specific lines, I don't know where to (try and) optimize. Any clues?
Extra background:
Based on the "The Galois/Counter Mode of Operation (GCM)" paper by David A. McGrew, I read "Multiplication in a binary field can use a variety of time-memory tradeoffs. It can be implemented with no key-dependent memory, in which case it will generally run several times slower than AES. Implementations that are willing to sacrifice modest amounts of memory can easily realize speeds greater than that of AES."
If you look at the results, the basic AES-CBC engine performances are very comparable. AES-GCM adds the GCM and reuses the AES engine beneath it in CTR mode (faster than CBC). However, GCM also adds multiplication in the GF(2^128) field in addition to the CTR mode, so there could be other areas of slowdown. Anyway, that's why I tried profiling the code.
For the interested, where is my quick test performance benchmark. It's inside a Windows 8 VM and YMMV. The test is configurable but currently it's to simulate crypto overhead in encrypting many cells of a database (=> many but small plaintext input)
Creating initial random bytes ...
Benchmark test is : Encrypt=>Decrypt 10 bytes 100 times
Name time (ms) plain(bytes) encypted(bytes) byte overhead
.NET ciphers
AES128 1.5969 10 32 220 %
AES256 1.4131 10 32 220 %
AES128-HMACSHA256 2.5834 10 64 540 %
AES256-HMACSHA256 2.6029 10 64 540 %
BouncyCastle Ciphers
AES128/CBC 1.3691 10 32 220 %
AES256/CBC 1.5798 10 32 220 %
AES128-GCM 26.5225 10 42 320 %
AES256-GCM 26.3741 10 42 320 %
R - Rerun tests
C - Change size(10) and iterations(100)
Q - Quit
This is a rather lame move from Microsoft as they obviously broke a feature that worked well before Windows 8, but no longer, as explained in this MSDN blog post:
:
On Windows 8 the profiler uses a different underlying technology than
what it does on previous versions of Windows, which is why the
behavior is different on Windows 8. With the new technology, the
profiler needs the symbol file (PDB) to know what function is
currently executing inside NGEN’d images.
(...)
It is however on our backlog to implement in the next version of Visual Studio.
The post gives directions to generate the PDB files yourself (thanks!).

Z80 memory refresh register

Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.

Resources