intel CPU thoery GEMM performance does not match real test performance [duplicate]

intel CPU thoery GEMM performance does not match real test performance [duplicate] - intel

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell.
As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.
This seems to be verified here,
How do I achieve the theoretical maximum of 4 FLOPs per cycle?
,and here,
Sandy-Bridge CPU specification.
However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core
http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.
Can someone explain this to me?
Edit:
I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them.
In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything else. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.
If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.
Intel
Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.
Intel Core 2 and Nehalem (SSE/SSE2):
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge (AVX1):
8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):
16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
(Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver
16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed, so "cycles" isn't a constant in your performance calculations.)
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.
32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)
Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.
Current Intel chips only have actual computation directly on standard float16 in the iGPU.
AMD
AMD K10:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
x86 low power
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle
AMD Bobcat:
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:
3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
ARM
ARM Cortex-A9:
1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle
ARM Cortex-A15:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
Qualcomm Krait:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
IBM POWER
IBM PowerPC A2 (Blue Gene/Q), per core:
8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
SP elements are extended to DP and processed on the same units
IBM PowerPC A2 (Blue Gene/Q), per thread:
4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
SP elements are extended to DP and processed on the same units
Intel MIC / Xeon Phi
Intel Xeon Phi (Knights Corner), per core:
16 DP FLOPs/cycle: 8-wide FMA every cycle
32 SP FLOPs/cycle: 16-wide FMA every cycle
Intel Xeon Phi (Knights Corner), per thread:
8 DP FLOPs/cycle: 8-wide FMA every other cycle
16 SP FLOPs/cycle: 16-wide FMA every other cycle
Intel Xeon Phi (Knights Landing), per core:
32 DP FLOPs/cycle: two 8-wide FMA every cycle
64 SP FLOPs/cycle: two 16-wide FMA every cycle
The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one f.p. add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.
The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.
This is possible indeed, but who would make such a weird optimization for one specific processor?

Related

Weak scaling of mpi program for matrix-vector multiplication

I have written some mpi code that solves systems of equations using the conjugate gradient method. In this method matrix-vector multiplication takes up most of the time. As a parallelization strategy, I do the multiplication in blocks of rows and then I
gather the results in the root process. The remaining steps are performed by the root process which broadcasts the results whenever a matrix-vector multiplication needs to be performed.
The strong scaling curve representing the speedup is fine
But the weak scaling curve representing the efficiency is quite bad
In theory, the blue curve should be close to the red one.
Is this intrinsic to the parallelization strategy or am I doing something wrong?
Details
The measurements are in seconds. The experiments are performed on a cluster where each node has
2 Skylake processors running at 2.3 GHz, with 18 cores each,192 GB of DDR3 RAM and 800GB NVMe local drive. Amdahl's prediction is computed with the formula (0.0163 + 0.9837 / p)^-1. Gustafson's prediction is computed with the formula 0.9873+0.0163/p where p is the number of processors. The experimental values are in both cases obtained by dividing the time spent by a single computation unit by the time spent by p computation units.
For weak scaling, I start with a load per processor of W = 1768^2 matrix entries. Then the load with p processors will be M^2 = pW matrix cells. Thus, we set the matrix's side to M = 1768 \sqrt{p} for p processes. This gives: 1768, 3536, 5000, 7071 and 10000 cells for 1, 2, 4, 8, 16, 32 processors respectively. I also fix the number of iterations to 500 so that the measurements are not affected by the variability in the data.

I think your Amdahl formula is wrong. It should be:
S_p = F_s + p F_p
You have a division that should be a multiplication. See for instance https://theartofhpc.com/istc/parallel.html#Gustafson'slaw

How do you get rid of Hz when calculating MIPS?

I'm learning computer structure.
I have a question about MIPS, one of the ways to calculate CPU execution time.
The MIPS formula is as follows.
And if the clock rate is 4 GHz and the CPI is 1.
I think MIPS is 4,000hz.
Because It's 4 * 10^9 * Hz / 1 * 10^6.
I don't know if it's right to leave units of Hz.

Hz is 1/s. MIPS is actually "mega instruction / s". To be clear, "Per" is the slash for division: Mega Instructions / Second
4GHz is 4G s⁻¹. Divide that by 1 cycle per instruction... but cycle is period, which is inverse of frequency.
It's not 4000Hz MIPS because the MIPS means "Per second". You wrote 4000 million instruction 1/s 1/s.
You eat the Hz because it's part of the name you are labeling it with.

For any quantity, it's important to know what units it's in. As well as a scale factor (like a parsec is many times longer than an angstrom), units have dimensions, and this is fundamental (at least for physical quantities like time; it can get less obvious when you're counting abstract things).
Those example units are both units of length so they have the same dimensions; it's physically meaningful to add or subtract two lengths, or if we divide them then length cancels out and we have a pure ratio. (Obviously we have to take care of the scale factors, because 1 parsec / 1 angstrom isn't 1, it's 3.0856776e+26.) That is in fact why we can say a parsec is longer than an angstrom, but we can't say it's longer than a second. (It's longer than a light-second, but that's not the only possible speed that can relate time and distance.)
1 m/s is not the same thing as 1 kg, or as dimensionless 1.
Time (seconds) is a dimension, and we can treat instructions-executed as another dimension. (I'll call it I since there isn't a standard SI unit for it, AFAIK, and one could argue it's not a real physical dimension. That doesn't stop this kind of dimensional analysis from being useful, though.)
(An example of a standard count-based unit is the mole in chemistry, a count of particles. It's an SI base unit.)
Counts of clock cycles can be treated as another dimension, in which case clock frequency is cycles / sec rather than just s-1. (Seconds, s, are the base SI unit of time.) If we want to make sure we're correctly cancelling it out in both sides, that's a useful approach, especially when we have quantities like cycles/instruction (CPI). Thus cycle time is s/c, seconds per cycle.
Hz has dimensions of s-1, so if it's something per second we should not use Hz, if something isn't dimensionless. (Clock frequencies normally are given in Hz, because "cycles" aren't a real unit in physics. That's something we're introducing to make sure everything cancels properly).
MIPS has dimensions of instructions / time (I / s), so the factors that contribute to it must cancel out any cycle counts. And we're not calling it Hz because we're considering "instructions" as a real unit, thus 4000 MIPS not 4000 MHz. (And MIPS is itself a unit so it's definitely not 4000 Hz MIPS; if it made sense to combine units that way, that would be dimensions of I/s2, which would be an acceleration not a speed.).
From your list of formulas, leaving out the factor of 10^6 (that's the M in MIPS, just a metric prefix in front of Instructions Per Sec, I/s)
instructions / total time obviously works without needing any cancelling.
I / (c * s / c) = I / s after cancelling cycles in the denominator
(I * c/s) / (I * c/I) cancel the Instructions in the denominator:
(I * c/s) / c cancel the cycles:
(I * 1/s) / 1 = I/s
(c/s) / (c/I) cancel cycles:
(1/s) / (1/I) apply 1/(1/I) = I reciprocal of reciprocal
(1/s) * I = I / s
All of these have dimensions of Instructions / Seconds, i.e. I/S or IPS. With a scale factor of 106, that's MIPS.
BTW, this is called "dimensional analysis", and in physics (and other sciences) it's a handy tool to see if a formula is sane, because both sides must have the same dimensions.
e.g. if you're trying to remember how position (or distance-travelled) of an accelerating object works, d = 1/2 * a * t^2 works because acceleration is distance / time / time (e.g. m/s^2), and time-squared cancels out the s^-2 leaving just distance. If you mis-remembered something like 1/2 a^2 * t, you can immediate see that's wrong because you'd have dimensions of m / s^4 * s = m / s^3 which is not a unit of distance.
(The factor of 1/2 is not something you can check with dimensional analysis; you only get those constant factors like 1/2, pi, e, or whatever from doing the full math, e.g. taking the derivative or integral, or making geometric arguments about linear plots of velocity vs. time.)

What is the clock rate of my chip processor?

My "My PC" displays that my processor is "i7-8550U CPU # 1.80 GHz 1.99 GHz". I wonder which is my intel chip clock rate, 1.8 or 1.99?

1.99GHz is probably the current frequency, while 1.80GHz is the "sticker frequency", the rated sustained frequency under all workload / temperature conditions.
https://en.wikipedia.org/wiki/Intel_Turbo_Boost.
It's an i7, so it does support Turbo.

Fast CRC using PCLMULQDQ - final reduction of 128 bits

I've been trying to implement the algorithm for CRC32 calculation as described here:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf; and I'm confused about Step 3, the reduction from 128 bits to 64 bits. Hopefully someone can clarify the steps for me:
Multiply the upper 64 bits of the remaining 128 bits with the constant K5, result is 96 bits
Multiply the upper 64 bits of the 96 bits with the constant K6, result is 64 bits
Do these results need to be XORed with the lower 64 bits of the starting 128 bits, following the pattern of the previous folds? Figure 8 in the paper doesn't specify, and I am confused by the alignment of the data in the figure.

It appears that figure 8 shows the final 128 bits (working remainder xor last 128 bits of buffer data) followed by 32 bits of appended zeros, since crc32 = (msg(x) • x^32) % p(x). So you see a total of 160 bits as 64|32|32|32.
My assumption is that the upper 64 bits are multiplied by K5 producing a 96 bit product. That product is then xor'ed to the lower 96 bits of the 160 bit entity (remember the lower 32 bits start off as 32 bits of appended zeros).
Then the upper 32 bits (not 64) of the lower 96 bits are multiplied by K6 producing a 64 bit product which is xor'ed to the lower 64 bits of the 160 bit entity.
Then the Barrett algorithm is used to produce a 32 bit CRC from the lower 64 bits of the 160 bit entity (where the lower 32 bits were originally appended zeros).
To explain the Barrett algorithm, consider the 64 bits as a dividend, and the CRC polynomial as a divisor. Then remainder = dividend - (⌊ dividend / divisor ⌋ · divisor). Rather than actually divide, pclmulqdq is used, and ⌊ dividend / divisor ⌋ = (dividend · ⌊ 2^64 / divisor ⌋) >> 64.

Loss of precision: Parallel 2D FFT using 1D FFTW and MPI calls

I am trying to match the result of doing a 2D FFT using the already implemented calls in FFTW and my own version of 2D FFT via 1D FFTw calls and mpi communication.
So, resuming, I've followed the theory:
1 - FFT in y dimension
2 - transpose the matrix
3 - MPI_Alltoall communication
4- FFT in x dimension
5- transpose back
6 - MPI_Alltoall communication
I've tried with a small number of processors (8- 12) and it seems to work fine. Correctness has been carried out using RMS between the 2D FFTW call and my own result. However, as I increase the number of cores and size of the matrix, it seems that I am loosing precision, ie. RMS fails because the error is larger than expected (I set the error to 1.0e-10):
Given a matrix of 512x512 and considering RMS of tolerance of 1.0e-6:
"ERROR: Position 1 0 expected im-part 10936907150.600960 and got 10936907150.600958"
Given a matrix of 2048x2048 and considering RMS of tolerance of 1.0e-6:
"ERROR:Position 1 0 expected real part -4294967296.000107 and got -4294967295.999999"
Why would I lose precision if all I am using are double types?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex