I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.
I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.
In lines 80-105 I make this sum by using 64-bit arithmetic.
In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.
I am running this on an Intel(R) i7-4980HQ CPU # 2.80GHz with gcc 7.2.0.
Do you know a way around this?
The low part can be trivially saved with movq instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a); intrinsic storing to memory, or _mm_cvtsi128_si64 storing to register.
There is also a _mm_storeh_pd counterpart, which requires cast to pd and may cause a stall due to mixing floating points and integers.
The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e) and then saved with movq.
Currently I have an AT89C2051 microcontroller hooked up to an ISD soundchip through a multiplexer-demultiplexer setup. I have other things too but my focus is making sound execute as fast as possible. Currently the speed of the chip is 3.6Mhz since another microcontroller is driving this microcontroller.
Based on documentation and experimentation, The sound chip requires 7 bytes to be sent to it in order for me to make it play sound between any two ranges of memory. The part that takes the time is transmitting the seven bytes.
This is the code I have so far that works:
FLUSH bit P3.7 ;Low=enable data reception
ENXMIT bit P3.5 ;High=Enable data transmission
GLOBALCLK bit P3.1 ;TXD: clock (connects to soundcard clock)
GLOBALDAT bit P3.0 ;RXD: I/O data line (connects to MISO and MOSI)
C_SND2 = address of soundcard 2
C_SND = address of soundcard 1
O_SND:
setb FLUSH ;disable reception
clr ENXMIT ;disable transmission
mov R7,A ;Parameter in: Accumulator = # bytes to transfer out.
mov A,#C_SND2 ;A=address of soundcard 2
mov R6,#C_SND ;R6=address of soundcard 1
jnb SS,nc1 ;Parameter in: SS = soundcard to use.
xch A,R6 ;Switch A + R6 if other soundcard is wanted.
nc1:
;NOTE: soundcard Slave select lines are connected together through an inverter.
mov P1,R6 ;Enable wrong soundcard (to disable the correct one)
mov R0,#BUFOUT ;Set data space pointer
mov P1,A ;Now enable only the correct soundcard
setb ENXMIT ;Enable data transmission
tx2:
mov A,#R0 ;Load a byte from our data space
;This fragment executes 8x but I only showed it one time here.
;I avoided loops. DJNZ requires two clock cycles (7uS) to process command.
clr FLUSH ;Enable data input **
setb GLOBALDAT ;Set data to high impedance so input can be captured **
clr GLOBALCLK ;Lower clock line to accept bit input **
mov C,GLOBALDAT ;Get incoming bit
setb FLUSH ;Disable data input
rrc A ;store incoming bit and load next output bit
mov GLOBALDAT,C ;set data line to bit
setb GLOBALCLK ;raise clock so soundcard accepts bit
;end of repeating fragment
mov #R0,A ;save what soundcard sent us to our data space
inc R0 ;increment pointer
djnz R7,tx2 ;Keep going until all bytes are processed
clr ENXMIT ;Disable further transmissions
setb GLOBALDAT ;Set data line to high
mov P1,R6 ;reset the SS line to tell soundcard we're done.
;Save audio statuses to RAM
mov AUDSTATL,BUFOUT
mov AUDSTAT,BUFOUT+2
ret
As you can see, the data line (RXD) from the microcontroller is shared across every data line in the system through multiplexers/demultiplexers. This means that I need to make the line only unidirectional (not bi-directional) by enabling reception and transmitting nothing when I want to receive data.
I called the receive enable "FLUSH" because it also flushes other output lines which are out of the scope of this question.
Now what I want to try to do is make this code fragment execute much faster.
So I'm looking at these lines:
clr FLUSH ;Enable data input **
setb GLOBALDAT ;Set data to high impedance so input can be captured **
clr GLOBALCLK ;Lower clock line to accept bit input **
and thought instead of consecutive clear and setb statements on individual pins on the same port, I could use ANL or ORL but then if I did it direct on the port, the result might not update correctly due to the behaviour of the 8051.
Is there any other way I can modify the repetitive code to make the thing run faster?
I already did save at least 380 microseconds (6.5 microseconds per removal of DJNZ multiplied by the usage of it 8 times for a byte + 1 to load counter variable for DJNZ + other commands in loop then multiplied by bytes to process command (7 bytes))
But I want to save more than that.
Any ideas?
Except that I don't plan to remove the outer loop because doing that will increase the need for rom space substantially more and I don't have too much free rom space left.
It is possible to use two different port pins for FLUSH and ENXMIT. By doing so, you can go for ANL or ORL on the port directly.
If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory?
Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with the four sections of an XMM register and return four results at once to another register. Or as close to "at once" as possible. (/edit)
Edit2: thanks to PaulR's answer I guess the terminology I'm looking for is "gather", and the question therefore is "what's the best way to implement gather for systems pre-AVX2?".
I assume there isn't an instruction for this since ...well, one doesn't appear to exist as far as I can tell and anyway it doesn't seem to be what SSE is designed for at all.
("Pointer-like value" meaning something like an integer index into an array pretending to be the heap; mechanically very different but conceptually the same thing. If, say, one wanted to use 32-bit or even 16-bit values regardless of the native pointer size, to fit more values in a register.)
Two possible reason I can think of why one might want to do this:
thought it might be interesting to explore using the SSE registers for general-purpose... stuff, perhaps to have four identical 'threads' processing potentially completely unrelated/non-contiguous data, slicing through the registers "vertically" rather than "horizontally" (i.e. instead of the way they were designed to be used).
to build something like romcc if for some reason (probably not a good one), one didn't want to write anything to memory, and therefore would need more register storage.
This might sound like an XY problem, but it isn't, it's just curiosity/stupidity. I'll go looking for nails once I have my hammer.
The question is not entirely clear, but if you want to dereference vector register elements then the only instructions which might help you here are AVX2's gathered loads, e.g. _mm256_i32gather_epi32 et al. See the AVX2 section of the Intel Intrinsics Guide.
SYNOPSIS
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include "immintrin.h"
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flag : AVX2
DESCRIPTION
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
OPERATION
FOR j := 0 to 7
i := j*32
dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:256] := 0
So if I understood this correctly, your title is misleading and you really want to:
index into the concatenation of all XMM registers
with an index held in a part of an XMM register
Right?
That's hard. And a little weird, but I'm OK with that.
Assuming crazy tricks are allowed, I propose self-modifying code: (not tested)
pextrb eax, xmm?, ? // question marks are the position of the pointer
mov edx, eax
shr eax, 1
and eax, 0x38
add eax, 0xC0 // C0 makes "hack" put its result in eax
mov [hack+4], al // xmm{al}
and edx, 15
mov [hack+5], dl // byte [dl] of xmm reg
call hack
pinsrb xmm?, eax, ? // put value back somewhere
...
hack:
db 66 0F 3A 14 00 00 // pextrb ?, ? ,?
ret
As far as I know, you can't do that with full ymm registers (yet?). With some more effort, you could extend it to xmm8-xmm15. It's easily adjustable to other "pointer" sizes and other element sizes.
I am really having hard time understanding the difference. Some say they are same, while others say there is a slight difference. What's the difference, exactly? I would like it if you explained with some analogy.
Bits per second is straightforward. It is exactly what it sounds like. If I have 1000 bits and am sending them at 1000 bps, it will take exactly one second to transmit them.
Baud is symbols per second. If these symbols — the indivisible elements of your data encoding — are not bits, the baud rate will be lower than the bit rate by the factor of bits per symbol. That is, if there are 4 bits per symbol, the baud rate will be ¼ that of the bit rate.
This confusion arose because the early analog telephone modems weren't very complicated, so bps was equal to baud. That is, each symbol encoded one bit. Later, to make modems faster, communications engineers invented increasingly clever ways to send more bits per symbol.¹
Analogy
System 1, bits: Imagine a communication system with a telescope on the near side of a valley and a guy on the far side holding up one hand or the other. Call his left hand "0" and his right hand "1," and you have a system for communicating one binary digit — one bit — at a time.
System 2, baud: Now imagine that the guy on the far side of the valley is holding up playing cards instead of his bare hands. He is using a subset of the cards, ace through 8 in each suit, for a total of 32 cards. Each card — each symbol — encodes 5 bits: 00000 through 11111 in binary.²
Analysis
The System 2 guy can convey 5 bits of information per card in the same time it takes the System 1 guy to convey one bit by revealing one of his bare hands.
You see how the analogy seems to break down: finding a particular card in a deck and showing it takes longer than simply deciding to show your left or right hand. But, that just provides an opportunity to extend the analogy profitably.
A communications system with many bits per symbol faces a similar difficulty, because the encoding schemes required to send multiple bits per symbol are much more complicated than those that send only one bit at a time. To extend the analogy, then, the guy showing playing cards could have several people behind him sharing the work of finding the next card in the deck, handing him cards as fast as he can show them. The helpers are analogous to the more powerful processors required to produce the many-bits-per-baud encoding schemes.
That is to say, by using more processing power, System 2 can send data 5 times faster than the more primitive System 1.
Historical Vignette
What shall we do with our 5-bit code? It seems natural to an English speaker to use 26 of the 32 available code points for the English alphabet. We can use the remaining 6 code points for a space character and a small set of control codes and symbols.
Or, we could just use Baudot code, a 5-bit code invented by Émile Baudot, after whom the unit "baud" was coined.³
Footnotes and Digressions:
For example, the V.34 standard defined a 3,429 baud mode at 8.4 bits per symbol to achieve 28.8 kbit/sec throughput.
That standard only talks about the POTS side of the modem. The RS-232 side remains a 1 bit per symbol system, so you could also correctly call it a 28.8k baud modem. Confusing, but technically correct.
I've purposely kept things simple here.
One thing you might think about is whether the absence of a playing card conveys information. If it does, that implies the existence of some clock or latch signal, so that you can tell the information-carrying absence of a card from the gap between the display of two cards.
Also, what do you do with the cards left over in a poker deck, 9 through King, and the Jokers? One idea would be to use them as special flags to carry metadata. For example, you'll need a way to indicate a short trailing block. If you need to send 128 bits of information, you're going to need to show 26 cards. The first 25 cards convey 5×25=125 bits, with the 26th card conveying the trailing 3 bits. You need some way to signal that the last two bits in the symbol should be disregarded.
This is why the early analog telephone modems were specified in terms of baud instead of bps: communications engineers had been using that terminology since the telegraph days. They weren't trying to confuse bps and baud; it was simply a fact, in their minds, that these modems were transmitting one bit per symbol.
I don't understand why everyone is making this complicated (answers).
I'll just leave this here.
So above would be:
Signal Unit: 4 bits
Baud Rate [Signal Units per second]: 1000 Bd (baud)
Bit Rate [Baud Rate * Signal Unit]: 4000 bps (bits per second)
Bit rate and Baud rate, these two terms are often used in data
communication. Bit rate is simply the number of bits (i.e., 0’s and
1’s) transmitted per unit time. While Baud rate is the number of
signal units transmitted per unit time that is needed to represent
those bits.
Bit rate:-
Bit rate is nothing but number of bits transmitted per second.For example if Bit rate is 1000bps then 1000 bits are i.e. 0s or 1s transmitted per second.
Baud rate:-
It means number of time signal changes its state.When the signal is binary then baud rate and bit rate are same.
According to What’s The Difference Between Bit Rate And Baud Rate?:
Bit Rate
The speed of the data is expressed in bits per second (bits/s or bps).
The data rate R is a function of the duration of the bit or bit time
(TB) (Fig. 1, again):
R = 1/TB
Rate is also called channel capacity C. If the bit time is 10 ns, the
data rate equals:
R = 1/10 x 10–9 = 100 million bits/s
This is usually expressed as 100 Mbits/s.
Baud Rate
The term “baud” originates from the French engineer Emile Baudot, who
invented the 5-bit teletype code. Baud rate refers to the number of
signal or symbol changes that occur per second. A symbol is one of
several voltage, frequency, or phase changes.
NRZ binary has two symbols, one for each bit 0 or 1, that represent
voltage levels. In this case, the baud or symbol rate is the same as
the bit rate. However, it’s possible to have more than two symbols per
transmission interval, whereby each symbol represents multiple bits.
With more than two symbols, data is transmitted using modulation
techniques.
When the transmission medium can’t handle the baseband data,
modulation enters the picture. Of course, this is true of wireless.
Baseband binary signals can’t be transmitted directly; rather, the
data is modulated on to a radio carrier for transmission. Some cable
connections even use modulation to increase the data rate, which is
referred to as “broadband transmission.”
By using multiple symbols, multiple bits can be transmitted per
symbol. For example, if the symbol rate is 4800 baud and each symbol
represents two bits, that translates into an overall bit rate of 9600
bits/s. Normally the number of symbols is some power of two. If N is
the number of bits per symbol, then the number of required symbols is
S = 2^N. Thus, the gross bit rate is:
R = baud rate x log2S = baud rate x 3.32 log10S
If the baud rate is 4800 and there are two bits per symbol, the number
of symbols is 2^2 = 4. The bit rate is:
R = 4800 x 3.32 log(4) = 4800 x 2 = 9600 bits/s
If there’s only one bit per symbol, as is the case with binary NRZ,
the bit and baud rates remain the same.
First something I think necessary to know:
It is symbol that is transferred on a physical channel. Not bit. Symbol is the physical signals that is transferred over the physical medium to convey the data bits. A symbol can be one of several voltage, frequency, or phase changes. Symbol is decided by the physical nature of the medium. While bit is a logical concept.
If you want to transfer data bits, you must do it by sending symbols over the medium. Baud rate describes how fast symbols change over a medium. I.e. it describes the rate of physical state changes over the medium.
If we use only 2 symbols to transfer binary data, which means one symbol for 0 and another symbol for 1, that will lead to baud rate = bit rate. And this is how it works in the old days.
If we are lucky enough to find a way to encode more bits into a symbol, we can achieve higher bit rate with the same baud rate. And this is when the baud rate < bit rate. This doesn't mean the transfer speed is slowed down. It actually means the transfer efficiency/speed is increased.
And the communicating parties have to agree on how bits are represented by each physical symbol. This is where the modulation protocols come in.
But the ability of sending multiple bits per symbol doesn't come free. The transmitter and receiver will be complex depending on the modulation methods. And more processing power is required.
Finally, I'd like to make an analogy:
Suppose I stand on the roof of my house and you stand on your roof. There's a rope between you and me. I want to send some apples to you through a basket down the rope.
The basket is the symbol. The apple is the data bits.
If the basket is small (a physical limitation of the symbol), I may only send one apple per basket. This is when baud/basket rate = bit/apple rate.
If the basket is big, I can send more apples per basket. This is when baud rate < bit rate. I can send all the apples with less baskets. But it takes me more effort (processing power) to put more apples into the basket than put just one apple. If the basket rate remains the same, the more apples I put in one basket, the less time it takes.
Here are some related threads:
How can I be sure that a multi-bit-per-symbol encoding schema exists?
What is difference between the terms bit rate,baud rate and data rate?
bit rate : no of bits(0 or 1 for binary signal) transmitted per second.
baud rate : no of symbols per second.
A symbol consists of 'n' number of bits.
Baud rate = (bit rate)/n
So baud rate is always less than or equal to bit rate.It is equal when signal is binary.
Baud rate is mostly used in telecommunication and electronics, representing symbol per second or pulses per second, whereas bit rate is simply bit per second. To be simple, the major difference is that symbol may contain more than 1 bit, say n bits, which makes baud rate n times smaller than bit rate.
Suppose a situation where we need to represent a serial-communication signal, we will use 8-bit as one symbol to represent the info. If the symbol rate is 4800 baud, then that translates into an overall bit rate of 38400 bits/s. This could also be true for wireless communication area where you will need multiple bits for purpose of modulation to achieve broadband transmission, instead of simple baseline transmission.
Hope this helps.
Bit per second is what is means - rate of data transmission of ones and zeros per second are used.This is called bit per second(bit/s. However, it should not be confused with bytes per second, abbreviated as bytes/s, Bps, or B/s.
Raw throughput values are normally given in bits per second, but many software applications report transfer rates in bytes per second.
So, the standard unit for bit throughput is the bit per second, which is commonly abbreviated bit/s, bps, or b/s.
Baud is a unit of measure of changes , or transitions , that occurs in a signal in each second.
For example if the signal changes from one value to a zero value(or vice versa) one hundred times per second, that is a rate of 100 baud.
The other one measures data(the throughput of channel), and the other ones measures transitions(called signalling rates).
For example if you look at modern modems they use advanced modulation techniques that encoded more than one bit of data into each transition.
Thanks.
The bit rate is a measure of the number of bits that are transmitted per unit of time.
The baud rate, which is also known as symbol rate, measures the number of symbols that are transmitted per unit of time.
A symbol typically consists of a fixed number of bits depending on what the symbol is defined as(for example 8bit or 9bit data). The baud rate is measured in symbols per second.
Take an example, where an ascii character 'R' is transmitted over a serial channel every one second.
The binary equivalent is 01010010.
So in this case, the baud rate is 1(one symbol transmitted per second) and the bit rate is 8 (eight bits are transmitted per second).
This topic is confusing because there are 3 terms in use when people think there are just 2, namely:
"bit rate": units are bits per second
"baud": units are symbols per second
"Baud rate": units are bits per second
"Baud rate" is really a marketing term rather than an engineering term. "Baud rate" was used by modem manufactures in a similar way to megapixels is used for digital cameras. So the higher the "Baud rate" the better the modem was perceived to be.
The engineering unit "baud" is already a rate (symbols per second) which distinguishes it from the "Baud rate" term. However, you can see from the answers that people are confusing these 2 terms together such as baud/sec which is wrong.
From an engineering point of view, I recommend people use the term "bit rate" for "RS-232" and consign to history the term "Baud rate". Use the term "baud" for modulation schemes but avoid it for "RS-232".
In other words, "bit rate" and "Baud rate" are the same thing which means how many bits are transmitted along a wire in one second. Note that bits per second (bps) is the low-level line rate and not the information data rate because asynchronous "RS-232" has start and stop bits that frame the 8 data bits of information so bps includes all bits transmitted.
Bit rate is a measure of the number of data bits (that's 0's and 1's) transmitted in one second. A figure of 2400 bits per second means 2400 zeros or ones can be transmitted in one second, hence the abbreviation 'bps'.
Baud rate by definition means the number of times a signal in a communications channel changes state. For example, a 2400 baud rate means that the channel can change states up to 2400 times per second. When I say 'change state' I mean that it can change from 0 to 1 up to 2400 times per second. If you think about this, it's pretty much similar to the bit rate, which in the above example was 2400 bps.
Whether you can transmit 2400 zeros or ones in one second (bit rate), or change the state of a digital signal up to 2400 times per second (baud rate), it the same thing.
Serial Data Speed:
Data rate (bps) = 1/Tb
Tb is the time duration of 1 bit
If the bit duration is 2ms then data rate is 1/2x10-3 , which is about 500 bps.
Baud rate:
Baud rate is defined as no. of signalling elements(symbols) in a given unit of time (say 1 sec) or it means number of time signal changes its state.When the signal is binary then baud rate and bit rate are same.
Bit rate:- Bit rate is nothing but number of bits transmitted per second.For example if Bit rate is 1000 bps then 1000 bits are i.e. 0s or 1s transmitted per second.
There are few other terms similar to this (i.e serial speed, bit rate, baud rate, USB transfer rate),and i guess(?) the values that are printed on serial monitor relates to serial speed, baud rate and USB transfer rate. Bit rate isn't an another term, please correct me if i am wrong, because serial monitor prints some values at an interval of time and value is definitely a set of bits. so if one value is printed i can say no of bits present in the respective value which gets printed on serial monitor per unit time will be the bit rate.
Replies here are misleading. Saying true, but no one tell that for UART a symbol is not a single character but a single bit and this way the question was tagged.
For example 115200/8n1 is 11520 bytes per second as a single ASCII character is a 1 start bit plus 8 data bit plus 1 stop bit.
As correctly pointed out in the other replies, the bitrate is the amount of logical (or "abstract high level") information transferred in a given time, while baud rate is the number of symbols (more or less "signal changes") in the physical line in a given time.
While it is easy to understand that if a transmitted symbol carries 4 bits of information, then the bitrate is four time the baud rate, things get blurred in case of, for example, a RS-232 serial line.
The classic serial line works on bytes (well, "frames"), not bits. There is no way to transmit fewer that 8 bits (i.e. a byte), because the serial line defines a "frame" (I assume frames with 8 data bits, no parity, 1 start bit and 1 stop bit); and this is usually OK, because devices (computers) work probably on bytes, not single bits.
Given that, when a device sends a byte, i.e. 8 bits, the physical lines transmits 10 symbols, because to the original data composed of 8 bits, 2 more are added (start and stop bits, they are needed for synchronization). Some confusion can arise because the symbols transmitted on the physical line are also called "bits", but they are really symbols (MARK and SPACE, actually).
So in that classic RS-232 (in case of "8N1" frame) the bitrate is actually 8/10 of the baudrate. If we add the parity bit, the ratio lowers further and becomes 8/11.
The number of bits or symbols per second translates directly to the duration of them (bits or symbols). What does it mean for an engineer designing a system? It means that if he is designing a line filter to protect the line or reduce the noise, he should take the duration (or frequency) of the symbols transmitted on that line. For a baudrate of 1000 baud, he knows that the frequency of the signal is 1 KHz, and that a symbol has a duration of 1 ms. Fine. But if he has to calculate how much time is needed to transfer a file from a device to another, say a file of 1000 bytes, he must consider the bitrate, not the baudrate! Because the devices, at higher level, do not even see the start and stop bits, they are only a burden which slow down the communication (but they are useful for error checking).
To take it to the extreme case, imagine that a serial frame is just a bit long. For every bit transmitted by a device, three symbols would travel in the physical line. And if a parity were added, then four symbol would travel: the bitrate would be 1/4 of the baudrate. And if we add a second stop bit, the bitrate goes down to 1/5 of the baudrate!
Me again with another innocuous Z80 question :-) The way my emulator core is currently structured, I am incrementing the lower 7 bits of the memory refresh register every time an opcode byte is fetched from memory - this means for multi-byte instructions, such as those that begin DD or FD, I am incrementing the register twice - or in the instance of an instruction such as RLC (IX+d) three times (as it is laid out opcode1-opcode2-d-opcode3).
Is this correct? I am unsure - the Z80 manual is a little unclear on this, as it says that CPDR (a two byte instruction) increments it twice, however the 'Memory Refresh Register' section merely says it increments after each instruction fetch. I have noticed that J80 (an emulator I checked as I'm not sure about this) only increments after the first opcode byte of an instruction.
Which is correct? I guess it is not hugely important in any case, but it would be nice to know :-) Many thanks.
Regards,
Phil Potter
The Zilog timing diagrams hold the answer to your question.
A refresh occurs during T3 and T4 of all M1 (opcode fetch) cycles.
In the case of single-opcode instructions, that's one refresh per instruction. For single-prefix instructions (prefixes are read using M1 cycles) that's two refreshes per instruction.
For those weird DD-CB-disp-opcode and FD-CB-disp-opcode type instructions (weird because the displacement byte comes before the final opcode rather than after it), the number of refreshes is at least 3 (for the two prefixes and final opcode), but I'm not sure if the displacement byte is read as part of an M1 cycle (which would trigger another refresh) or a normal memory read cycle (no refresh). I'm inclined to believe the displacement byte is read in an M1 cycle for these instructions, but I'm not sure. I asked Sean Young about this; he wasn't sure either. Does anyone know for certain?
UPDATE:
I answered my own question re those weird DD-CB-disp-opcode and FD-CB-disp-opcode instructions. If you check Zilog's documentation for these type instruction, such as
RLC (IX+d), you'll note that the instruction requires 6 M-cycles and 23 T-cycles broken down as: (4,4,3,5,4,3).
We know the first two M-cycles are M1 cycles to fetch the DD and CB prefixes (4 T-cycles each). The next M-cycle reads the displacement byte d. But that M-cycle uses only 3 T-cycles, not 4, so it can't be an M1 cycle; instead it's a normal Memory Read cycle.
Here's the breakdown of the RLC (IX+d) instruction's six M-cycles:
M1 cycle to read the 0xDD prefix (4 T-cycles)
M1 cycle to read the 0xCB prefix (4 T-cycles)
Memory Read cycle to read the displacement byte (3 T-cycles)
M1 cycle to fetch the 0x06 opcode and load IX into the ALU (5 T-cycles)
Memory Read cycle to calculate and read from address IX+d (4 T-cycles)
Memory Write cycle to calculate RLC and write the result to address IX+d (3 T-cycles)
(The RLC calculation overlaps M-cycles 5 and 6.)
These type instructions are unique in that they're the only Z80 instructions that have non-contiguous M1 cycles (M-cycles 1, 2 and 4 above). They're also the slowest!
Paul
Sean Young's Z80 Undocumented Features has a different story. Once for unprefixed, twice for a single prefix, also twice for a double prefix (DDCB only), and once for no-op prefix.
Block instructions of course affect R every time they run (and they run BC times).
I've seen a couple of comments now that these weird DDCB and FDCB instructions only increment the R register twice.
It's always been my assumption (and the way I implemented my Z80 emulator) that the R register is implemented at the end of every M1 cycle.
To recap, these weird DDCB and FDCB instructions are four bytes long:
DD CB disp opcode
FD CB disp opcode
It's clear that the two prefix opcodes are read using M1 cycles, causing the R register to be incremented at the end of each of those cycles.
It's also clear that the displacement byte that follows the CB prefix is read by a normal Read cycle, so the R register is not incremented at the end of that cycle.
That leaves the final opcode. If it's read by an M1 cycle, then either the R register is incremented at the end of the cycle, resulting in a total of 3 increments, or the Z80 special cases this M1 cycle and doesn't increment the R register.
There's another possibility. What if the final opcode is read by a normal Read cycle, like the displacement byte that preceded it, and not by an M1 cycle? That of course would also cause the R register to be incremented only twice for these instructions, and wouldn't require the Z80 to make an exception of not incrementing the R register at the end of every M1 cycle.
This might also make better sense in terms of the Z80's internal state. Once it switches to normal Read cycles to read an instruction's additional bytes (in this case the displacement byte following the CB prefix), it never switches back to M1 cycles until it starts the next instruction.
Can anyone test this on real Z80 hardware, to confirm the value of R register following one of these DDCB or FDCB instructions?
All references I can find online say that R is incremented once per instruction irrespective of its length.