How to find overall CPU usage in a multi-tenant environment? - cpu-usage

I have a single server/multiple workers architecture to run distributed queries. Once a query finishes, it reports back to the server the total time taken for completion and I maintain a running counter of the total CPU time (totalCpuTime) for completion of ALL the queries. With this count I want to expose the overall CPU Usage of the cluster on the server.
I was thinking of polling for the totalCpuTime every 3 minutes (sampling rate). Lets say I have two data-points d1 and d2 which are sampling_rate = 1800s apart. I did cpuUsage = (d2 - d1)/total_number_of_workers * sampling_rate but this is giving me a number greater than 100 and I don't know how to make sense of it.
Any ideas or other approaches?

Related

Monitoring CPU Utilization using Prometheus

I am trying to monitor the cpu utilization of the machine in which Prometheus is installed and running. I have a metric process_cpu_seconds_total. I can find irate or rate of this metric. But I am not too sure how to come up with the percentage value for CPU utilization. Is there anyway I can use this process_cpu_seconds_total metric to find the CPU utilization of the machine where Prometheus runs?
A late answer for others' benefit too:
If you're wanting to just monitor the percentage of CPU that the prometheus process uses, you can use process_cpu_seconds_total, e.g. something like:
avg by (instance) (irate(process_cpu_seconds_total{job="prometheus"}[1m]))
However, if you want a general monitor of the machine CPU as I suspect you might be, you should set-up Node exporter and then use a similar query to the above, with the metric node_cpu_seconds_total. E.g.:
avg by (instance,mode) (irate(node_cpu_seconds_total{mode!='idle'}[1m]))
The rate or irate are equivalent to the percentage (out of 1) since they are how many seconds used of a second, but usually need to be aggregated across cores/cpus on the machine.
Brian Brazil's post on Prometheus CPU monitoring is very relevant and useful: https://www.robustperception.io/understanding-machine-cpu-usage
One way to do is to leverage proper cgroup resource reporting. Cgroup divides a CPU core time to 1024 shares. So by knowing how many shares the process consumes, you can always find the percent of CPU utilization.
Now in your case, if you have the change rate of CPU seconds, which is how much time the process used CPU time in the last time unit (assuming 1s from now on). Then depends how many cores you have, 1 CPU in the last 1 unit will have 1 CPU second. So if your rate of change is 3 and you have 4 cores.
3/4 = 75% CPU utilization.
It is only a rough estimation, as your process_total_cpu time is probably not very accurate due to delay and latency etc.

peak read capacity units dynamo DB table

I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.

graphite: how to get per-second metrics from batch metrics?

I'm trying to measure a online mini-batch processing system with a per-second metrics (total query per second). For every batch, a metric (e.g. "stats.gauges.<host>.query.count") will be send to graphite. batches are processed in several different hosts in parallel and a batch of data take about 5 seconds to process.
I've tried:
simply sum series: sumSeries(stats.gauges.*.query.count),
the result metrics is many times greater than the actual value;
scale to 1 second:
scaleToSeconds(sumSeries(stats.gauges.*.query.count),
1), the result metrics is much less than the actual value;
integral then derivative: nonNegativeDerivative(sumSeries(integral(stats.gauges.*.query.count))), same as the first case ...
send gauges with
delta=True param, then derivative. the result is about 20% greater
than the actual value
so, how to get per-second metrics from batch metrics? what is the best practice?
You should use carbon-aggregator service to add several metrics together as they come in. There is an example which fits your case at http://graphite.readthedocs.io/en/latest/config-carbon.html#aggregation-rules-conf
As your batch takes 5 secs to process, frequency should be 5 to buffer all the metrics. After five seconds, aggregator will sum them up and write to graphite.

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

CPU memory access time

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

Resources