CPU memory access time - cpu-usage

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?

I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.

Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

Related

Completely simultaneous execution of two instructions (RISCV)

The question comes from a RISCV implementation, but I think it may also apply to many other architectures.
From a code with two completely independent instructions in sequence (generic ISA notation):
REG1 = REG2 + REG3
REG4 = REG5 + REG6
In a pipelined implementation, assuming there are no other hazards (simultaneous r/w access to the registers is possible and there are two independent adders), is it a violation of the ISA if the two instructions are executed completely in parallel?
In other words, at the same clock edge, can the 3 registers (REG1, REG4 and PC) be updated at once (PC+8 for the RISCV-32 example)?
No, clearly there's no problem, since real CPUs do this all the time. (e.g. Intel since Haswell can run 4 independent add instructions per clock: https://www.realworldtech.com/haswell-cpu/4/ https://uops.info/ https://agner.org/optimize/).
It only has to maintain the illusion of having run instructions one at a time, following the ISA's sequential execution model. The same concept as the C "as-if" rule applies.
If the ISA doesn't guarantee anything about timing, like that you can delay N clock cycles with N nop or other instructions, nothing stops a specific implementation from doing as much work as possible in a clock cycle. (Some microcontrollers do have specific timing guarantees or specifications, so code can delay for N cycles with delay loops. Or at least specific implementations of some ISAs have such guarantees.)
It's 100% normal for modern CPUs to average more than 1 instruction per clock, despite stalling sometimes on cache misses and branch mispredicts, so that clearly means fetching, decoding, and executing multiple instructions per clock cycle in other cycles. See also Modern Microprocessors
A 90-Minute Guide! for some basics of superscalar in-order and out-of-order pipelines.

Why wouldn't a small Firebase Functions app just use a single Function to handle logic?

...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

How to improve the speed of MPI_scatter/MPI_gather?

I found that time used for MPI_scatter/MPI_gather continuously increased (somehow linearly) as the number of workers increases, especially when the workers are across different nodes.
I thought that MPI_scatter/MPI_gather is a parallel process, and wonder what leads to the above increasing? Is there any trick to make it faster, especially for workers distributing across CPU nodes?
The root rank has to push a fixed amount of data to the other ranks. As long as all ranks reside on the same compute node, the process is limited by the memory bandwidth available. Once more nodes become involved, the network bandwidth, usually much lower than the memory bandwidth, becomes the limiting factor.
Also the time to send a message is roughly divided in two parts - initial (network setup and MPI protocol handshake) latency and then the time it takes to physically transfer the actual data bits. As the amount of data is fixed, the total physical transfer time remains the same (as long as the transport type and therefore the bandwidth stays the same) but more setup/latency overhead is being added with each new rank that data is scattered to or gathered from, therefore the linear increase in the time it takes to complete the operation.
How an MPI_Scatter/Gather will work varies between implementations. Some MPI implementations may choose to use a series of MPI_Send as an underlying mechanism.
The parameters that may affect how MPI_Scatter works are:
1. Number of processes
2. Size of data
3. Interconnect
For example, an implementation may avoid using a broadcast for very small number of ranks sending/receiving very large data.

Resources