How powrmetrisc works? - apple-m1

Apple seems to have upgraded the output of powermetrics on M1 CPUs to include reports of consumed power. Output looks roughly like this:
sudo powermetrics | grep -i power
....
E-Cluster Power: 230 mW
P0-Cluster Power: 3475 mW
P1-Cluster Power: 268 mW
ANE Power: 0 mW
DRAM Power: 1037 mW
CPU Power: 3973 mW
GPU Power: 125 mW
Package Power: 7348 mW
GPU Power: 125 mW
Is any of these reported powers actually measured, e.g. measured off of any of the voltage regulators? Or is it a case of a table look up, e.g. having a pre-characterized table of estimated power based on CPU/GPU core workloads the OS just returns a value?
What is included is Package Power? I would have expected that the sum NE+CPU+GPU+DRAM to be close to the total Package Power. Os that difference is caused by power for all the glue logic surrounding the CPU/GPU/NE and all IOs on the M1?

Related

Weak scaling of mpi program for matrix-vector multiplication

I have written some mpi code that solves systems of equations using the conjugate gradient method. In this method matrix-vector multiplication takes up most of the time. As a parallelization strategy, I do the multiplication in blocks of rows and then I
gather the results in the root process. The remaining steps are performed by the root process which broadcasts the results whenever a matrix-vector multiplication needs to be performed.
The strong scaling curve representing the speedup is fine
But the weak scaling curve representing the efficiency is quite bad
In theory, the blue curve should be close to the red one.
Is this intrinsic to the parallelization strategy or am I doing something wrong?
Details
The measurements are in seconds. The experiments are performed on a cluster where each node has
2 Skylake processors running at 2.3 GHz, with 18 cores each,192 GB of DDR3 RAM and 800GB NVMe local drive. Amdahl's prediction is computed with the formula (0.0163 + 0.9837 / p)^-1. Gustafson's prediction is computed with the formula 0.9873+0.0163/p where p is the number of processors. The experimental values are in both cases obtained by dividing the time spent by a single computation unit by the time spent by p computation units.
For weak scaling, I start with a load per processor of W = 1768^2 matrix entries. Then the load with p processors will be M^2 = pW matrix cells. Thus, we set the matrix's side to M = 1768 \sqrt{p} for p processes. This gives: 1768, 3536, 5000, 7071 and 10000 cells for 1, 2, 4, 8, 16, 32 processors respectively. I also fix the number of iterations to 500 so that the measurements are not affected by the variability in the data.
I think your Amdahl formula is wrong. It should be:
S_p = F_s + p F_p
You have a division that should be a multiplication. See for instance https://theartofhpc.com/istc/parallel.html#Gustafson'slaw

What is the clock rate of my chip processor?

My "My PC" displays that my processor is "i7-8550U CPU # 1.80 GHz 1.99 GHz". I wonder which is my intel chip clock rate, 1.8 or 1.99?
1.99GHz is probably the current frequency, while 1.80GHz is the "sticker frequency", the rated sustained frequency under all workload / temperature conditions.
https://en.wikipedia.org/wiki/Intel_Turbo_Boost.
It's an i7, so it does support Turbo.

How can I substitute TRNG probabilities in R?

I have a hardware true random number generator. It has very high performance. When I run the output via ENT I get the following report:
Total: 1073741824 1.000000
Entropy = 8.000000 bits per byte.
Optimum compression would reduce the size of this 1073741824 byte file
by 0 percent.
Chi square distribution for 1073741824 samples is 247.87, and randomly
would exceed this value 61.38 percent of the times.
Arithmetic mean value of data bytes is 127.4957 (127.5 = random).
Monte Carlo value for Pi is 3.141666379 (error 0.00 percent). Serial
correlation coefficient is 0.000056 (totally uncorrelated = 0.0).
In R I have been taking the random bytes and using them in sample as a probability vector.
This takes the form of sample(x, size, replace=FALSE, prob=myprobvector) which works fine until I test the entropy of it and find out it is not much better than the Mersenne-Twister, which gives 99.9% entropy efficiency. In my case I am using a base of 75 choices so Log2(75) = 6.22881869. The entropy of MT approach is 6.223578221 and with the vector it is 6.223578468. I believe It is still using MT and just using the probability vector to add weight and move things around. How can I get it to use just the values off the hardware? (Assume they are being read like a file.)

I2C duty cycle significance

What is the significance of changing the duty cycle in i2c protocol? the feature is available in most of the advanced microcontrollers.
The duty cycle is significant, because different I²C modes have slightly different duty cycle.
Check the I²C Specification v5 Table 10, pg. 48.
Mode | t_HIGH | t_LOW | ratio
--------------+--------+-------+-------
Standard-mode | 4.00u | 4.7u | 0.85
Fast-mode | 0.60u | 1.3u | 0.46
Fast-mode Plus| 0.26u | 0.5u | 0.52
Your controller would need to decide on one ratio in order to be within the I²C specification.
So for instance, if the controller is using the standard mode timing ratio, this would prevent you from achieving fast mode timings with maximum clock frequency.
These are the ratios as defined in the standard for minimal t_HIGH:t_LOW. However, notice that the 100 kHz period is 10 us, but t_HIGH + t_LOW from the table is less than 10 us. Thus, the ratio of the actual values can vary as long as the t_HIGH and t_LOW minimum timings are met.
The point of these ratios is to illustrate that I²C timing constraints are different between I²C modes. They aren't mandatory ratios that controllers need to keep.
For example, 4 us high, 6 us low would be a 0.67 ratio, yet Standard-mode timings would be met.
STM32F4 example:
The STM32F4xx series only supports 100 kHz and 400 kHz communication speeds (RM0090, rev 5, pg. 818, Section 27.2).
I don't know where your ratios come from, but the reference manual states (RM0090, rev 5, pg. 849, Section 27.6.8) a 1:1 ratio for standard mode, and 1:2 or 9:16 ratio for fast mode.
So for instance, to achieve the highest standard mode clock frequency of 100 kHz, t_HIGH and t_LOW need to be programmed for 5 us, because the ratio is 1:1.
For Fast-mode, for example with a 1:2 ratio, you would need to program t_HIGH to 3.33 us and t_LOW to 6.66 us for 100 kHz. Yet that would not meet timing requirements for Standard-mode.
So you cannot use STM32F4 programmed for Fast-mode while keeping Standard-mode timings at highest Standard-mode frequency.
And vice versa: You cannot use Standard-mode and program 400 kHz Fast-mode, because the default 1:1 ratio is out-of-spec for 2.5 us, because t_LOW would be 1.25 us < 1.3 us.

How do I code a cost optimization function in R?

I'm busy with a small project where a large amount of samples have been taken from a manufacturing process (2700 samples of 11 items). A specified Upper and Lower Specification Limit has been set, and items under the LSL are said to cost $3 to fix, while items above the USL are said to cost $5 to fix. The data is spread with a uniform distribution.
How would I go about deciding where to centre the process (given that the distribution would stay the same along the centre line) to minimize total cost? I know how to do it iteratively, but I'd like a more optimal way to solve this problem.
EDIT: Here is an example of the data I'm working with.
One sample would be, for instance
45.62565379
47.06496942
46.39000538
46.44387364
45.81911053
45.25935862
48.75357907
46.50918593
46.87072887
46.60195194
48.09000017
There are 2701 more samples like the one above (albeit with different values) making up my population. The population mean is 47.66 and population standard deviation is 1.425. The UCL is 48.98 and the LCL is 46.34. The USL has been set to 50 and the LSL to 45.
Currently the process is centered around the population mean, but the amount of samples with means above 50 is proportionally larger than that of the amount of samples with means under 45, meaning that the process is more expensive, as it costs $5 to fix a batch above the USL and only $3 to fix it under the LSL. How do I decide where to centre the process if its distribution around the centre line will remain the same to minimize cost?

Resources