Branch prediction exercise - pipeline

Assume a machine with a 7-stage pipeline. Assume that branches are resolved in the sixth stage. Assume that 30% of instructions are branches.
3.1 How many instructions of wasted work are there per branch mis-prediction on this machine?
3.2 Assume 1000 instructions are on the correct path of a program and assume a branch predictor accuracy of 10%. How many instructions are fetched on this machine?
3.3 Let’s say we modified the machine so that it used dual path execution (where an equal number of instructions are fetched from each of the two branch paths). Assume branches are resolved before new branches are fetched. Write how many instructions would be fetched in this case, as a function of N. (Please show your work for full credit.)
Attempt at a solution
3.1) 4 assuming bypassing at s6
3.2) 30% brancches, predictor accuracy: 10%, 1000 Ins, so there are 300 branches and 700 normal instructions. Let x be total number of instructions fetched, so
$$0.1x=700$$
$$x=7000$$
Therefore, 7000 ins were fetched
3.3) I am not sure about this one. It confuses me in the following way, how do we execute instructions on different path without considering dependences?. What is a dual path in this case?. Can u explain in more detail what this questio is asking?.

3.1 5 instructions, no bypassing
3.2 Assume base CPI = 1
CPI = 1 + 30% * 90% * 5 = 2.35
Branches causes 135% slow down.
2350 instructions are fetched on this machine.

Related

SKU comparison for ADX VM

This link shows comparison among various VM SKUs available for ADX cluster. My question is about the following two SKUs:-
D14 v2 (Category: compute-optimized) , SSD:614 GB , Cores:16, RAM:112GB
DS14 v2 + 4 TB PS (Category: storage-optimized) , SSD:4TB , Cores:16 RAM:112GB
Purely looking at the numbers (SSD,RAM,Cores) it looks like #2 has everything #1 has but on top of that #2 also has 4TB of SSD -- whereas #1 has only 614GB of SSD. So based on that I will always choose #2 over #1. So what is the meaning of category here then? #1 falls in the category "compute-optimized" whereas #2 belongs to "storage-optimized". My question is that if a category is decided on the basis of configuration mentioned here then we should be able to call #2 as both storage as well as compute optimized because #2 has the same compute as #1 and then it has something extra over #1. Then why #2 is only listed as storage optimized. I am trying to understand if there an additional edge of using #1 over #2 for compute intensive jobs -- because if I just look at the numbers here I don't see any reason (apart from cost , which too is not much different though) why I shouldn't use #2 over #1. Probably #1 has something unique which is missing in #2 which is not specified in that link.
Based on your question, it appears you're largely disregarding the consideration of cost - the following table (in the same doc you've linked-to) summarizes the main considerations for choosing a SKU - you can see one of them is Cost per GB cache per core.
Another example - let's assume you can reach the same total cache (SSD) size with either SKUs you mentioned - with one, your cluster will have X nodes, and with the other Y nodes. If Y > X, data in the other cluster will be distributed across more nodes, allowing more parallelism during ingestion and queries. Of course, the cost for both options could be different.
Last - I would strongly recommend, given that cost isn't meaningless in your case, that you consult with the cost estimator, and see how a different choice of SKU affects the total estimated cost of your cluster (given you know the volumes of data you're dealing with).

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

Generating CPU utilization levels

First I would like to let you know that I have recently asked this question already, however it was considered to be unclear, see Linux: CPU benchmark requiring longer time and different CPU utilization levels. This is now a new attempt to formulate the question using a different approach.
What I need: In my research, I look at the CPU utilization of a computer and analyze the CPU utilization pattern within a period of time. For example, a CPU utilization pattern within time period 0 to 10 has the following form:
time, % CPU used
0 , 21.1
1 , 17
2 , 18
3 , 41
4 , 42
5 , 60
6 , 62
7 , 62
8 , 61
9 , 50
10 , 49
I am interested in finding a simple representation for a given CPU utilization pattern. For the evaluation part, I need to create some CPU utilization patterns on my laptop which I will then record and analyse. These CPU utilization patterns that I need to create on my laptop should
be over a time period of more than 5 minutes, ideally of about 20 minutes.
the CPU utilization pattern should have "some kind of dynamic behavior" or in other words, the % CPU used should not be (almost) constant over time, but should vary over time.
My Question: How can I create such a utilization pattern? Of course, I could just run an arbitrary program on my laptop and I will obtain a desired CPU pattern. However, this solution is not ideal since a reader of my work has no means to repeat this experiment if wanted since he has not access to the program I used. Therefore it would be much more beneficial to use something instead of an arbitrary program on my laptop (in my previous post I was thinking about open source CPU benchmarks for example). Can anyone recommend me something?
Many thanks!
I suggest a moving average. Select a window size and use it to average over. You'll need to decide what type of patterns you want to identify since the wider the window, the more smoothing you get and the fewer "features" you'll see. And CPU activity is very bursty. For example, if you are trying to identify cache bottlenecks, you'll want a small window, probably in the 10ms to 100ms range. If instead you want to correlate to longer term features, such as energy or load, you'll want a larger window, perhaps 10sec to minutes.
It looks like you are using OS provided CPU usage and not hardware registers. This means that the OS is already doing some smoothing. It may also be doing estimation for some performance values. Try to find documentation on this if you are integrating over a smaller window. A word of warning: this level of information can be hard to find. You may have to do a lot of digging. Depending upon your familiarity with kernel code, it may be easier to look at the code.

CPU time on multicored/hyperthreaded

I need to observe the CPU time took by a process in a multicored/hyper-threaded. Suppose a Xeon, Opteron, etc.
Let's assume I have 4 cores, hyper threaded, meaning 8 'virtual' cores.
Let X the program I want to run an observed how much CPU time it took.
If I run process X in my cpu, I get CPU time A. Suppose A is more than 5 minutes.
If I run 8 copies of the same process X, I'll get CPU times B1, B2…, B8.
If I run 7 copies of the same process X, I'll get CPU times C1, C2…, C7.
If I run 4 copies of the same process X, I'll get CPU times D1, D2…, D4.
QUESTIONs:
What's the relationship between numbers A, Bi, Ci, Di?
Is A smaller than Bi? How much?
What about Ci, Di?
Are times Bi different between them?
What about Ci, Di?
What's the relationship between numbers A, Bi, Ci, Di?
Expect D1=D2=D3=D4=A*1, except if you have L2 cache issues (conflicts, faults, ...) where you will have a slightly greater number instead of 1.
Expect B1=B2=B3=B4=...=B8=A*1.3. The number 1.3 may vary between 1.1 and 2 depending on you application (certain processor subparts are hyperthreaded, others are not). It was computed from similar statistics, with I give here using the notations of the question: D=23 seconds, and A=18 seconds, according to a private forum. The unthreaded process did integer computations without input/output. Exact application was checking Adem coefficients in algebra of motivic Steenrod (don't know what it is; settings were (2n+e,n) with n=20).
In the case of sevent processes (Cs), if you assign each process to a core (with /usr/bin/htop on linux), then you will have one of the process (C5 for example) that has the same execution time as an A, and the others (in my example, C1, C2, C3, C4, C6, C7) would have same values than Ds. If you do not assign the processes to cores, and your process lasts enough for the OS do balance them between the cores, they will converge to the mean of the C.
Are times Bi different between them? What about Ci, Di?
Depend on your OS scheduler and on its configuration. And the percentage shown by /bin/top from linux is cheating, it will show nearly 100% for A, Bs, Cs and Ds.
To assess performances, don't forget /usr/bin/nettop (and variants nethogs, nmon, iftop, iptraf), iotop (and variants iostat, latencytop), and collectl (+colmux) and sar (+sag, +sadf).
As 2021, there could be high variations when running multiple experiments. For instance, over 50% of difference.
Two gold standards:
Run in single-core mode
Disabling hyperthreading.
For detecting the issue:
Run the same algorithm multiple times.
In theory this could be used when running experiments:
Run each experiment k times.
However, this is incomplete when comparing running time as a group of K could in conditions non-comparable with other K experiments.
To alleviate that:
Run each experiment k times.
Randomize the order of the experiments.
For publication purposes, that's not enough but it might be useful for fast turn-around, even with k = 2.
H/T: discussion in the slack space of the planning community, related to the conference ICAPS: https://www.icaps-conference.org

GPU programming via JOCL uses only 6 out of 80 shader cores?

I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
for n=1 it takes 12.2 seconds
for n=2 it takes 6.3 seconds
for n=3 it takes 4.4 seconds
for n=4 it takes 3.4 seconds
for n=5 it takes 3.1 seconds
for n=6 and beyond, it takes 2.7 seconds.
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.

Resources