My "My PC" displays that my processor is "i7-8550U CPU # 1.80 GHz 1.99 GHz". I wonder which is my intel chip clock rate, 1.8 or 1.99?
1.99GHz is probably the current frequency, while 1.80GHz is the "sticker frequency", the rated sustained frequency under all workload / temperature conditions.
https://en.wikipedia.org/wiki/Intel_Turbo_Boost.
It's an i7, so it does support Turbo.
Related
I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell.
As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.
This seems to be verified here,
How do I achieve the theoretical maximum of 4 FLOPs per cycle?
,and here,
Sandy-Bridge CPU specification.
However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core
http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.
Can someone explain this to me?
Edit:
I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.
Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them.
In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything else. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.
If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.
Intel
Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.
Intel Core 2 and Nehalem (SSE/SSE2):
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge (AVX1):
8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):
16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
(Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver
16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed, so "cycles" isn't a constant in your performance calculations.)
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.
32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)
Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.
Current Intel chips only have actual computation directly on standard float16 in the iGPU.
AMD
AMD K10:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
x86 low power
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle
AMD Bobcat:
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:
3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
ARM
ARM Cortex-A9:
1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle
ARM Cortex-A15:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
Qualcomm Krait:
2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
IBM POWER
IBM PowerPC A2 (Blue Gene/Q), per core:
8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
SP elements are extended to DP and processed on the same units
IBM PowerPC A2 (Blue Gene/Q), per thread:
4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
SP elements are extended to DP and processed on the same units
Intel MIC / Xeon Phi
Intel Xeon Phi (Knights Corner), per core:
16 DP FLOPs/cycle: 8-wide FMA every cycle
32 SP FLOPs/cycle: 16-wide FMA every cycle
Intel Xeon Phi (Knights Corner), per thread:
8 DP FLOPs/cycle: 8-wide FMA every other cycle
16 SP FLOPs/cycle: 16-wide FMA every other cycle
Intel Xeon Phi (Knights Landing), per core:
32 DP FLOPs/cycle: two 8-wide FMA every cycle
64 SP FLOPs/cycle: two 16-wide FMA every cycle
The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.
The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one f.p. add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.
The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.
This is possible indeed, but who would make such a weird optimization for one specific processor?
I have dataset with 30k rows and 12 columns. I tried to apply SVM and RandomForest for my training data (20k rows and 11 columns) but it's taking a long time to get the result .
I have MacBook with processor 1.1 GHz Dual-Core Intel Core M and memory 8 GB 1600 MHz DDR3
Apple seems to have upgraded the output of powermetrics on M1 CPUs to include reports of consumed power. Output looks roughly like this:
sudo powermetrics | grep -i power
....
E-Cluster Power: 230 mW
P0-Cluster Power: 3475 mW
P1-Cluster Power: 268 mW
ANE Power: 0 mW
DRAM Power: 1037 mW
CPU Power: 3973 mW
GPU Power: 125 mW
Package Power: 7348 mW
GPU Power: 125 mW
Is any of these reported powers actually measured, e.g. measured off of any of the voltage regulators? Or is it a case of a table look up, e.g. having a pre-characterized table of estimated power based on CPU/GPU core workloads the OS just returns a value?
What is included is Package Power? I would have expected that the sum NE+CPU+GPU+DRAM to be close to the total Package Power. Os that difference is caused by power for all the glue logic surrounding the CPU/GPU/NE and all IOs on the M1?
What is the significance of changing the duty cycle in i2c protocol? the feature is available in most of the advanced microcontrollers.
The duty cycle is significant, because different I²C modes have slightly different duty cycle.
Check the I²C Specification v5 Table 10, pg. 48.
Mode | t_HIGH | t_LOW | ratio
--------------+--------+-------+-------
Standard-mode | 4.00u | 4.7u | 0.85
Fast-mode | 0.60u | 1.3u | 0.46
Fast-mode Plus| 0.26u | 0.5u | 0.52
Your controller would need to decide on one ratio in order to be within the I²C specification.
So for instance, if the controller is using the standard mode timing ratio, this would prevent you from achieving fast mode timings with maximum clock frequency.
These are the ratios as defined in the standard for minimal t_HIGH:t_LOW. However, notice that the 100 kHz period is 10 us, but t_HIGH + t_LOW from the table is less than 10 us. Thus, the ratio of the actual values can vary as long as the t_HIGH and t_LOW minimum timings are met.
The point of these ratios is to illustrate that I²C timing constraints are different between I²C modes. They aren't mandatory ratios that controllers need to keep.
For example, 4 us high, 6 us low would be a 0.67 ratio, yet Standard-mode timings would be met.
STM32F4 example:
The STM32F4xx series only supports 100 kHz and 400 kHz communication speeds (RM0090, rev 5, pg. 818, Section 27.2).
I don't know where your ratios come from, but the reference manual states (RM0090, rev 5, pg. 849, Section 27.6.8) a 1:1 ratio for standard mode, and 1:2 or 9:16 ratio for fast mode.
So for instance, to achieve the highest standard mode clock frequency of 100 kHz, t_HIGH and t_LOW need to be programmed for 5 us, because the ratio is 1:1.
For Fast-mode, for example with a 1:2 ratio, you would need to program t_HIGH to 3.33 us and t_LOW to 6.66 us for 100 kHz. Yet that would not meet timing requirements for Standard-mode.
So you cannot use STM32F4 programmed for Fast-mode while keeping Standard-mode timings at highest Standard-mode frequency.
And vice versa: You cannot use Standard-mode and program 400 kHz Fast-mode, because the default 1:1 ratio is out-of-spec for 2.5 us, because t_LOW would be 1.25 us < 1.3 us.
I have found below calculation from http://www.gridsouth.com/services/colocation/basics/bandwidth
I have no idea how they came up with final number 1.395 mbps. Can you please help me with the formula used in below example?
If your network provider bills you on average usage, let's say they sample your MBPS usage 100 times in one month (typically it would be more like every 5 minutes) and of those samples your network usage was measured as follows: 20 times: .1 MBPS, 30 times .1.5 MBPS, 30 times 1.8 MBPS, 15 times 1.9 MBPS and 5 times at 2 MBPS. If you average all these samples you would be billed at whatever the price is in your contract for 1.395 MBPS bandwidth.
(20*0.1 + 30*1.5 + ... + 5*2)/100 = 1.395
Looks like the standard definition for a (weighted) average...