Why Motorola 68k's 32-bit general-purpose registers are divided into data registers and address registers? - motorola

The 68k registers are divided into two groups of eight. Eight data registers (D0 to D7) and eight address registers (A0 to A7). What is the purpose of this separation, would not be better if united?

The short answer is, this separation comes from the architecture limitations and design decisions made at the time.
The long answer:
The M68K implements quite a lot of addressing modes (especially when compared with the RISC-based processors), with many of its instructions supporting most (if not all) of them. This gives a large variety of addressing modes combinations within every instruction.
This also adds a complexity in terms of opcode execution. Take the following example:
move.l $10(pc), -$20(a0,d0.l)
The instruction is just to copy a long-word from one location to another, simple enough. But in order to actually perform the operation, the processor needs to figure out the actual (raw) memory addresses to work with for both source and destination operands. This process, in which operands addressing modes are decoded (resolved), is called the effective address calculation.
For this example:
In order to calculate the source effective address - $10(pc),
the processor loads the value of PC (program) counter register
and adds $10 to it.
In order to calculate the destination effective address -
-$20(a0,d0.l), the processor loads the value of A0 register, adds the value of D0 register to it, then subtracts
$20.
This is quite a lot of calculations of a single opcode, isn't it?
But the M68K is quite fast in performing these calculations. In order to calculate effective addresses quickly, it implements a dedicated Address Unit (AU).
As a general rule, operations on data registers are handled by the ALU (Arithmetic Logical Unit) and operations involving address calculations are handled by the AU (Address Unit).
The AU is well optimized for 32-bit address operations: it performs 32-bit subtraction/addition within one bus cycle (4 CPU ticks), which ALU doesn't (it takes 2 bus cycles for 32-bit operations).
However, the AU is limited to just load and basic addition/subtraction operations (as dictated by the addressing modes), and it's not connected to the CCR (Conditional Codes Register), which is why operations on address registers never update flags.
That said, the AU should've been there to optimize calculation of complex addressing modes, but it just couldn't replace the ALU completely (after all, there were only about 68K transistors in the M68K), hence there are two registers set (data and address registers) each having their own dedicated unit.

So this is just based on a quick lookup, but using 16 registers is obviously easier to program. The problem could be that you would then have to make instructions for each of the 16 registers. Which would double the number of opcodes needed. Using half for each purpose is not ideal but gives access to more registers in general.

Related

The elegant way to handle ADCs with DMA in a RTOS

I'm currently setting up an AZURE RTOS (ThreadX on STM32), with Ethernet, SPI and ADCs activated.
This STM32 has to pass-through configuration information from time to time, coming from my PC over the Ethernet-Port.
It has to pass these information via SPI to two other STM32, which makes the first STM32 the system-controller / system-interface. This will be a low-priority task, since the activation of the passed configuration will be started by sync-lines, running from the system-controller to the two other STMs.
While doing so, the system-controller has to read-in ADC values constantly and pass them via Ethernet / TCP to my computer.
I've used the ThreadX TCP server example, as given by STM, as a starting point.
From there I've managed to set up three servers on three ports, communicating sucessfully with a python script on my PC (as a first test).
Now come the two great questions:
1)
Since my input signal may contain frequencies up to 2.5 MHz, I want to digitize this signal with the full 5 MSPS (Nyquist), which ADC3 is capable of.
The smallest internally available data-type at full resolution is uint16_t, which makes the data rate work out to be R = 16 * 5 MSPS = 80 MBit/s (worst-case, I bet, there is optimization possible ... e.g. 8 bits resolution, which halves the data-rate ... but this resolution might not be enough ... or 16 bits, and FFT afterwards, which is also sufficient, since I'm mostly interested in energy per frequency band, but initially I wanted to do this on my computer, for best flexibility).
Even if the Ethernet-IF is capable of doing 100MBit/s, the TCP layer of NetXDuo, I bet, is not.
(There is also USB OTG on this board available, but since networked devices are in my opinion more versatile, I prefer using Ethernet ... nevertheless, USB might be a backup solution)
From my measurements, a data-stream transmitted to the uC via TC from within python, and mirrored back within a thread to my PC allows for relatively consistent 20 MBit/s.
... How do I push this speed to a better level?
(I think 20MBit/s is the back-and-forth data-rate, so one-way may be faster)
However. Second question:
2)
The ADC within the STM is capable of storing data via DMA to memory.
There are two callbacks available, one at half-full, one at full buffer state.
My problem is mostly about the way of reading out the DMA and/or triggering the conversion in the first place.
How do you do this the "right" way on a RTOS (such that you don't brake the RT in RTOS)?
I see some options here, what are the pros/cons you can think of?
a) Let the ADC run freely, calling the call-backs at the respective fill-levels, triggering a TCP-transmission whenever one of the call-backs is reached
-> may lead to glitches due to insufficient speed of the TCP layer in my opinion.
b) Let the ADC conversion be triggered by a thread, which is preempted and will later TCP-transmit the data, as soon as the memory-buffer is full
-> may lead to inconsistency in the converted values, since you get burst-style conversions, with gaps in between, while the buffer is read
c) Let a thread trigger each conversion individually
-> A no-go I think, since threads are not triggered that often, to get a decent sample-frequency
d) Let a free-running ADC trigger callbacks, let a thread do the FFT, transmit within another thread the data via TCP
-> May work, but is less flexible, since the data gets crunched within the uC.
--> Are there other ways you can think of / what do you think about the ways I named here?
--> What do you think about question 1)?
Have a nice day!

what are the advantages of implementing register in microcontroller architectures i.e. load store architecture

Major difference in RISC and CISC is that in RISC we must need to use registers to do any arithmetic or logic operation. But in case of CISC we can do such operation directly with memory locations. So what is the advantage of implementing register banking in micro controller architectures? Question is not for the advantage of RISC but the question is for what is need of register in RISC architecture. As in other architecture CISC operation can be done directly with meomery location we don't need to take it in register and then again move into the memory location. Below is the example:
CISC: MUL A,B
RISC:
LDA R0,A
LDA R1,B
MUL R0,R1
STR A,R0
So in above example what is the advantage of using R0 and R1 ie. registers. what is the advantage of load store architecture?
Register banking is something else, I assume you are simply asking about using a register directly or not. Well the memory access takes an eternity, even if cached. Several to hundreds of clock cycles for each of the operands where in RISC if you are assuming a pure register based scheme which not all are, the lines are getting fuzzy. With CISC if microcoded it is going to registers anyway, then the operation is happening, if not microcoded then it still gets latched into internal temporary storage (registers) and then the operation can begin. With risc you have a couple-three extra, simpler, instructions the latching to registers takes the same amount of time as it does in CISC. Now if the algorithm never uses that result or does not use it for a while, it might be a win for CISC (if not microcoded) but if the value is an intermediate value in an algorithm then a clear win for RISC. Even if everything is cached it is a half a dozen to dozen clock cycles to get each parameter and write it back, any cache misses and it is an eternity. Same for RISC but with more registers, and significantly faster access to those registers, zero or one clock for each value and to store back, for some percentage if not the whole algorithm.
As with any benchmarking it is trivial to show a RISC winning case and to show a CISC winning case.
The major difference between RISC and CISC is CISC are complicated time consuming instructions where RISC they are much simpler, you arrange the tasks you need to do and have tighter control over those tasks, you dont have a lot of waste per step. One could argue caches were created to deal with the inefficiencies of CISC or at least one popular one. Both benefit sure, but one relies on the other doesnt as much. Trivial to show CISC winning code and trivial to show RISC winning code. Same goes for VLIW, and others.
RISC designs are simpler, smaller, pipes can be shorter, compiler has more control over the performance, etc. So with microcontrollers you can have a very nice processor core with a 3 stage pipeline that is really low power and still quite efficient. The 6502, z80, 8051, etc have really died off for the most part, you still do see a lot of 8051s if you are looking, the desktop/laptop you might be reading this with probably has one 8051, but that is due to royalties and not because of its size or performance, you probably have several to dozens of ARM cores for every x86, within the same box or certainly around the house. A CISC is going to be relatively massive and inefficient, it might be possible to get the power consumption down to RISC levels, that may just be a matter of design and not CISC vs RISC, but the RISC implementations are doing a much better job at watts per mhz than the CISC implementations.
Using registers can simplify the operand fetching logic of functional unis. With CISC functional units should be able to fetch data from memory. With RISC, all the functional units will operate on registers as it is guaranteed that the data will be there, so less complicated.
Also, think of a case where you have multiple MUL operations some uses data at location A, some use B, shown below.
'MUL A, B'
'MUL C, B'
When you perform the operation in CISC, you will be reading B, twice. But in RISC, you load it to a register once, and can use multiple times. So less memory (cache) accesses.
Also think of number of bits needed to represent that MUL in CISC. As A, B, C can be memory locations, they could be anywhere within your address spaces. On the other hand with registers in RISC, bits needed to represent your operands are less, hence less complicated instruction set.
As from above responses, we can conclude that the using registers instead of direct memory location gives the benefit in efficiency in terms of clock cycle and so the power consumption. They also give the benefit in term of complexity of instructions.

Best way to shuffle 64-bit portions of two __m128i's

I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.
dst[ 0:63] = a[64:127]
dst[64:127] = b[0:63]
Equivalent to:
__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);
or
__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));
Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.
Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.
Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)
So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)
For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.
If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.
Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.

Computer with no registers

I wasn't able to find an answer to my question anywhere on the web, so I thought stackoverflow would be my best bet! My question simply is, is it possible to establish a computer with no registers? I know registers are temp. data holders and provice the fastest way possible to access data, but what are the consequences to the inexistence of registers in a computer, besides making data transmission a lot slower?
No. You can have a model of computation that doesn't involve registers. In fact, most theoretical models don't.
But as for a CPU, which is an electrical circuit, any kind of persistent state is implemented by a flip-flop, a.k.a. a register. There is no way to feed data into the circuits that perform processing without connecting a register to their inputs.
As for practical models of computation, i.e. instruction set architectures, you can avoid the terminology of calling anything a "register" but you inevitably need to define some means of data storage upon which operations act. Even if you don't, programmers will end up calling such storage locations as registers. Some old machines used the first page of RAM as primary scratch space, therefore the first 256 bytes were dubbed "registers," even if they were DRAM in the electronic sense. (Memory fails me; this could have been before DRAM. There is no difference between SRAM and what is physically called a register.)

Why are 32-bit registers divided into 4 parts?

I'm learning about registers. It looks like 32-bit registers are divided up so that they can be accessed as 8-bit registers. This looks very inefficient. Performance would be improved if they didn't do this. So why do they do it?
Also, it costs extra money to design them like this. Why not make the CPU cheaper by not doing it?
Because if you're only dealing with 8bit values, it'd be inefficient to have issue all the bitmasks to limit those 32/64bit register to just the 8bits you're working on.
So, x86 registers have
AH/AL = high/low 8bits of a 16bit register
AX = whole 16bit register
EAX = whole 32bit register
It's far more efficient, in terms of instruction size to have
mov ah, 0xXX (2 bytes)
rather than forcing
mov ax, 0x00XX (3 bytes)
mov eax, 0x000000XX (7 bytes)
As for "designing the cpu to make it cheaper" - it's for backwards compatibility. All modern x86 processors are actually internally a RISC design, with a major chunk of silicon dedicated to taking the x86 instructions coming in and converting them into the CPU's own internal micro-ops (which is basically a RISC instruction set).
The Intel 8080, which was the first "mainstream" microprocessor, had seven main 8-bit registers (A, B, C, D, E, H, and L). Because memory addresses were 16 bits, instructions that needed to use a non-constant memory operand would use a pair of registers (most commonly H and L, but sometimes B and C, or D and E) to form the address. Because the registers in the aforementioned pairs were often used together to represent 16-bit values, there were a few instructions which could operate upon the register pairs as 16-bit quantities. An instruction to add BC to HL would perform the addition by adding C to L, and then by adding B to H (plus a carry if needed). I'm not familiar enough with the 4004 or 8008 (the two predecessors of the 8080) to know if either of them did anything similar in its architecture.
When Intel produced the 8088, they included a full 16-bit arithmetic unit, but they wanted code which was written for the 8080 to be easily convertible to their new architecture. On the 8080, a lot of code had been written to "manually" form addresses out of the 8-bit parts, since doing so was often much faster than using the 16-bit instructions to do the math. For example, if one needed to access some specified table of 256 entries with an index stored in A, one could have done something like (Zilog notation show, but the 8080 had the same instructions):
ld hl,(baseOfTable) ; 16-bit address
ld c,a
ld b,#0
add hl,bc
ld a,(hl)
but if one could make certain the table was aligned on a 256-byte boundary, one could simplify the code considerably:
ld l,a
ld a,(tableBaseMSB) ; Just load the MSB--assume the LSB is zero
ld h,a
ld a,(hl)
With the 8088 instruction set, it wouldn't terribly often be useful for code written "from scratch" to access the upper and lower parts of registers separately, but there was a lot of code written for the 8080 which used such techniques, and Intel wanted to make it easy for people to convert such code for use on the 8088. Allowing registers to be built from 8-bit pieces was helpful in that regard.
Incidentally, there was another advantage to Intel's architecture: since it included four 16-bit only registers and four registers which could be used as either one 16-bit or two 8-bit parts, that made it possible for code to hold 12 values in registers if eight of them were 255 or less, or eleven values if six of them were 256 or less, etc. When using architectures with more registers, eking out an extra register here and there isn't quite so important, but on the 8088 it was often very helpful.
The ability to address portions of the registers has no effect on their performance when used as 32-bit registers. In that case, this capability just isn't used.
CPUs, regardless of their native bit size, need to manipulate 8-bit values very, very often. Strings of text, for example, are frequently manipulated as consecutive 8-bit values. International character sets are often manipulated as sets of consecutive 16-bit values. So being able to operate rapidly on 8-bit and 16-bit values is of tremendous importance.
If you're asking as a practical matter for x86 CPUs, it's too late. The very first PC CPUs didn't even have 32-bit registers, and compatibility has been retained all the way through.
Backwards compatibility. Processor manufacturers did not wanted to break compatibility with old software. This is the main reason why x86_64 processors still support 16bit software(virtual mode). If you look closely you'll see that majority of the features in x86 architecture are shaped by compatibility concerns. I'm no hating.

Resources