Performance of the MPI_win_lock - mpi

I face a big challenge to justify the performance of the following snapshot of my code that uses Intel MPI library
double time=0
time = time - MPI_Wtime();
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win_global_scheduling_step);
MPI_Win_unlock(0,win_global_scheduling_step);
time= time + MPI_Wtime();
if(id==0)
sleep(10);
printf("%d sync time %f\n", id, time);
The output depends on how much will rank 0 sleep.
As the following
0 sync time 0.000305
1 sync time 10.00045
2 sync time 10.00015
If I change the sleep of the rank 0 to be 5 seconds instead of 10 seconds, then the sync time at the other ranks will be of the same scale of 5 seconds
The actual data associated with the window "win_global_step" is owned by rank 0
Any discussion or thoughts about the code would be so helpful

If rank 0 owns the win_global_step, and rank 0 goes to sleep or cranks away on a computation kernel, or otherwise does not make MPI calls, many MPI implementations will not be able to service other requests.
There is an environment variable (MPICH_ASYNC_PROGRESS) you might try setting. It introduces some big performance tradeoffs, but it can in some instances let RMA operations make progress without explicit calls to MPI routines.
Despite the name "MPICH" in the environment variable, it might work for you as Intel MPI is based off of the MPICH implementation.

Related

Vivado: setting timing constraints for input and output delay, simulation mismatch and wrong clock behavior

I'm implementing a hashing algorithm in Verilog using Vivado 2019.2.1. Everything (including synthesis and implementation) worked quite well but I noticed recently that the results of the behavioral simulation (correct hash digest) differs from the post-synthesis/-implementation functional and timing simulation, i.e. I receive three different values for the same circuit design/code.
My base configuration contained a testbench using the default `timescale 1ns / 1ps and a #1 delay for toggling the clock register. I further constrained the clock to a frequency of 10 MHz using an xdc file. During synthesis, no errors (or even warnings, except some "parameter XYZ is used before its declaration") are shown and no non-blocking and blocking assignments are mixed inside my code. Nevertheless, I noticed that the post-* simulation (no matter if functional or timing) needs more clock cycles (e.g. 58 instead of 50 until the value of a specific register was toggled) for achieving the same state of the circuit. My design is entirely synchronous and driven by one clock.
This brought me to the Timing Report and I noticed that 10 input and 10 output delays are not constrained. In addition, the Design Timing Summary shows a worst negative slack for setup that is very close to the time of one clock cycle. I tried some combinations of input and output delays following the Vivado documentation and tutorial videos but I'm not sure how to find out which values are suitable. The total slack (TNS, THS and TPWS) is zero.
Furthermore, I tried to reduce the clock frequency because the propagation delay of some signals that control logic in the FSM (= top) module might be too large. The strange thing that happened then is that the simulation never reached the $finish; in my testbench and nothing except the clock register changed its value in the waveform. In the behavioral simulation everything works as expected but this doesn't seem to be influenced by constraints or even timing. Monitoring the o_round_done wire (determined by an LFSR in a separate submodule) in my testbench, I noticed that for the behavioral simulation the value of this wire changes with the clock whereas for the post-* simulations the value is changed with a small delay:
Behavioral Simulation
clock cycles: 481, round_done: 0
clock cycles: 482, round_done: 1
clock cycles: 483, round_done: 0
total of 1866 clock cycles
Post-Implementation Functional Simulation
clock cycles: 482, round_done: 0
clock cycles: 482, round_done: 1
clock cycles: 483, round_done: 1
clock cycles: 483, round_done: 0
total of 1914 clock cycles
Post-Implementation Timing Simulation
WARNING: "C:\Xilinx\Vivado\2019.2\data/verilog/src/unisims/BUFG.v" Line 57: Timing violation in scope /tb/fsm/i_clk_IBUF_BUFG_inst/TChk57_10300 at time 997845 ps $period (posedge I,(0:0:0),notifier)
WARNING: "C:\Xilinx\Vivado\2019.2\data/verilog/src/unisims/BUFG.v" Line 56: Timing violation in scope /tb/fsm/i_clk_IBUF_BUFG_inst/TChk56_10299 at time 998845 ps $period (negedge I,(0:0:0),notifier)
simulation never stops (probably because round_done is never 1)
Do you know what I'm doing wrong here? I'm wondering why the circuit is not behaving correctly at very low clock frequencies (e.g. 500 kHz) as, to my knowledge, this will provide enough time for each signal to "travel" to the correct destination.
Another thing I noticed is that one wire that is assigned to a register in a submodule is 8'bXX in the behavioral simulation until the connected register is "filled" but in the post-* simulations it is 8'b00 from the beginning. Any idea here?
Moreover, what is actually defining the clock frequency for the simulations? The values in the testbench (timescale and delay #) or the constraint in the xdc file?
I found an explanation for the question why the post-* simulations are behaving differently compared to the behavioral simulation w.r.t. clock cycles etc. in the Xilinx Vivado Design Suite User Guide for Logic Simulation (UG900).
What causes the "latency" before the actual computation of the design can start is called Global Set and Reset (GSR) and takes 100ns:
The glbl.vfile declares the global GSR and GTS signals and automatically pulses GSR for 100ns. (p. 217)
Consequently, I solved the issue by letting the test bench wait for the control logic (= finite-state machine) to be ready, i.e. changing to the state after RESET.

Calculate Queue Wait Times

I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated
Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.

C++ functions and MPI programing

From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.

How does this formula that calculates CPU utilization work?

I've been given this question
Consider a system running ten I/0-bound tasks and one CpU-bound task. Assume that the I/O-bound tasks issue and I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is .1 millisecond and that all processes are long running tasks Describe the CPU utilization for round-robin scheduler when:
a. The time quantum is 1 millisecond
b. The time quantum is 10 milliseconds
and I found answer for it
The time quantum is 1 millisecond: Irrespective of which process is scheduled, the
scheduler incurs a 0.1 millisecond context-switching cost for every context-switch.
This results in a CPU utilization of 1/1.1 * 100 = 91%.
The time quantum is 10 milliseconds: The I/O-bound tasks incur a context switch
after using up only 1 millisecond of the time quantum. The time required to cycle
through all the processes is therefore 10*1.1 + 10.1 (as each I/O-bound task
executes for 1millisecond and then incur the context switch task, whereas the CPU-
bound task executes for 10 milliseconds before incurring a context switch). The CPU
utilization is therefore 20/21.1 * 100 = 94%.
My only question how is this person deriving the formula for CPU Utilization? I can't seem to under stand where he/she is getting the numbers 20/21.1 * 100 = 94%, and 1/1.1 * 100 = 91%.
For the first case, every task uses 1msec to do work and .1msec to switch; thus, it is spending 1 of every 1.1 msec doing work.
For the second case, it is similar: of the 21.1 msec spent to go through all tasks, only 20 of that is doing actual work.
This is the best possible explanation to above problem :
http://jade-cheng.com/uh/coursework/ics-412/homework-4.pdf
for part a
we have 11 process(10 i/o,1 cpu). Each takes 1ms execution time and 0.1ms switching time.
So total time taken by a process is: 10(I/o)*1(1ms of cpu)+1(CPU bounded process)*1(1ms of cpu)+11*0.1(total switching time)=12.1ms.
In this 12.1ms, time for which cpu was busy/doing execution=10*1(For 10 I/O precoess)+1*1(for 1 CPU process)=10+1=11
CPU utilisation=(11/12.1)*100=(1/1.1)*100=91%approx
for part b
Though time quantum is 10ms, but I/O bound process will only occupy 1ms of cpu and then go to block state as it need I/O, and thus there is 0.1ms of context switching.
So total time taken by I/O bound process will be= 10*1
But CPU bounded process uses its whole 10ms of time slice and 0.1ms of switching. So it takes total time of 1*10=10ms
And total context switching time=11*0.1=1.1ms
Therefor total time taken=10+10+1.1=21.1ms
and time for which cpu was busy/doing execution=10*1+1*10=20
CPU utilisation=(20/21.1)*100=94%approx
I was going through the same question. this is how i understood it
In first case , when time quantum is 1 msec, if we think about gantt chart, all I/O bound process will come (lets call p1-p10) followed by p11 which is CPU bound. so total 10 context switches in 11 ms. so effective work done by CPU in that 11 msec is only 11-(10*.1ms) ie 10 ms. so CPU utilization is (10/11)*100= 90%
same way, in 2nd case, there will be 11 switches(last one is of CPU bound process) if i consider 20.1 msec of time. so effective time cpu worked is 20.1-(11*.1)= 19ms. so CPU utilization (19/20.1)*100=94%
I was confused beyond belief for some reason on this question...after looking at all the answers here I finally understood through carefully looking at the jade-cheng link given by another user. There was no formula I could find in the book (maybe I missed it) but here is my version of the answer, in a kind of pseudo-formula style:
WARNING: This is probably wrong, but maybe you can show me where I went wrong.
a)
[(10 I/O processes)(1ms) + (1 cpu process)(1ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(1ms) + (10 context switches)*(0.1ms)] = 10/11 = 91%
b)
[(10 I/O processes)(1ms) + (1 cpu process)(10ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(10ms) + (10 context switches)*(0.1ms)] = 20/21 = 95%

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Resources