Calculate Queue Wait Times - asp.net

I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated

Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.

Related

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!
I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

Performance of the MPI_win_lock

I face a big challenge to justify the performance of the following snapshot of my code that uses Intel MPI library
double time=0
time = time - MPI_Wtime();
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win_global_scheduling_step);
MPI_Win_unlock(0,win_global_scheduling_step);
time= time + MPI_Wtime();
if(id==0)
sleep(10);
printf("%d sync time %f\n", id, time);
The output depends on how much will rank 0 sleep.
As the following
0 sync time 0.000305
1 sync time 10.00045
2 sync time 10.00015
If I change the sleep of the rank 0 to be 5 seconds instead of 10 seconds, then the sync time at the other ranks will be of the same scale of 5 seconds
The actual data associated with the window "win_global_step" is owned by rank 0
Any discussion or thoughts about the code would be so helpful
If rank 0 owns the win_global_step, and rank 0 goes to sleep or cranks away on a computation kernel, or otherwise does not make MPI calls, many MPI implementations will not be able to service other requests.
There is an environment variable (MPICH_ASYNC_PROGRESS) you might try setting. It introduces some big performance tradeoffs, but it can in some instances let RMA operations make progress without explicit calls to MPI routines.
Despite the name "MPICH" in the environment variable, it might work for you as Intel MPI is based off of the MPICH implementation.

CPU Pipeline: How to find average instruction execution time

In a CPU with a four (4)-stage pipeline composed of fetch, decode, execute, and write
back, each stage takes 10, 6, 8, and 8 ns, respectively. Which of the following is an
approximate average instruction execution time in nanoseconds (ns) in the CPU? Here, the
number of instructions to be executed is sufficiently large. In addition, the overhead for the
pipelining process is negligible, and the latency impact from all hazards is ignored.
a) 6
b) 8
c) 10
d) 32
Answer is 10ns.But i thought it might be 8ns since execute stage takes 8ns.please explain simply.thanks
Each instruction must go though the four stages. Once the pipeline is full, the flow of instructions in and out is determined by the duration of the longest stage:
Fetch|Decode|Exec|Write|
10ns | 6ns |8ns | 8ns |
-----+------+----+-----+
I7 I6 I5 --> I4 : I3 : I2 : I1 --> out
-----+------+----+-----+
I1..I7 are instructions. I1..I4 are in the pipeline, I5..I7 are
waiting to enter the pipeline.
After 6ns I3 is ready to move from Decode to Exec, but cannot because the stage Exec is still occupied by I2
After 2ns more (8ns total), I1 moves out of Write, I2 moves from Exec to Write, and I3 can finally move from Decode to Exec
I4 is still blocking Fetch, so I5 cannot enter
After 2ns more (10ns total) I4 moves from Fetch to Exec, and I5 can enter.
You see that the pipeline stalls until the longest stage is completed; one instruction enters the pipeline every 10ns. (The Decode stage will be idle 40% percent of the time, and the Exec and Write stages 20% of the time.)
In a pipelined situation, "the rate at which an output is produced" is determined by the slowest stage. It doesn't matter how fast the rest of the pipeline works, things are bound by the rate decoder operates. Therefore we could expect the pipeline to produce an output every 10 ns. "The rate at which an output is produced" can be interpreted as the average execution time. So its 10 ns.

How does this formula that calculates CPU utilization work?

I've been given this question
Consider a system running ten I/0-bound tasks and one CpU-bound task. Assume that the I/O-bound tasks issue and I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is .1 millisecond and that all processes are long running tasks Describe the CPU utilization for round-robin scheduler when:
a. The time quantum is 1 millisecond
b. The time quantum is 10 milliseconds
and I found answer for it
The time quantum is 1 millisecond: Irrespective of which process is scheduled, the
scheduler incurs a 0.1 millisecond context-switching cost for every context-switch.
This results in a CPU utilization of 1/1.1 * 100 = 91%.
The time quantum is 10 milliseconds: The I/O-bound tasks incur a context switch
after using up only 1 millisecond of the time quantum. The time required to cycle
through all the processes is therefore 10*1.1 + 10.1 (as each I/O-bound task
executes for 1millisecond and then incur the context switch task, whereas the CPU-
bound task executes for 10 milliseconds before incurring a context switch). The CPU
utilization is therefore 20/21.1 * 100 = 94%.
My only question how is this person deriving the formula for CPU Utilization? I can't seem to under stand where he/she is getting the numbers 20/21.1 * 100 = 94%, and 1/1.1 * 100 = 91%.
For the first case, every task uses 1msec to do work and .1msec to switch; thus, it is spending 1 of every 1.1 msec doing work.
For the second case, it is similar: of the 21.1 msec spent to go through all tasks, only 20 of that is doing actual work.
This is the best possible explanation to above problem :
http://jade-cheng.com/uh/coursework/ics-412/homework-4.pdf
for part a
we have 11 process(10 i/o,1 cpu). Each takes 1ms execution time and 0.1ms switching time.
So total time taken by a process is: 10(I/o)*1(1ms of cpu)+1(CPU bounded process)*1(1ms of cpu)+11*0.1(total switching time)=12.1ms.
In this 12.1ms, time for which cpu was busy/doing execution=10*1(For 10 I/O precoess)+1*1(for 1 CPU process)=10+1=11
CPU utilisation=(11/12.1)*100=(1/1.1)*100=91%approx
for part b
Though time quantum is 10ms, but I/O bound process will only occupy 1ms of cpu and then go to block state as it need I/O, and thus there is 0.1ms of context switching.
So total time taken by I/O bound process will be= 10*1
But CPU bounded process uses its whole 10ms of time slice and 0.1ms of switching. So it takes total time of 1*10=10ms
And total context switching time=11*0.1=1.1ms
Therefor total time taken=10+10+1.1=21.1ms
and time for which cpu was busy/doing execution=10*1+1*10=20
CPU utilisation=(20/21.1)*100=94%approx
I was going through the same question. this is how i understood it
In first case , when time quantum is 1 msec, if we think about gantt chart, all I/O bound process will come (lets call p1-p10) followed by p11 which is CPU bound. so total 10 context switches in 11 ms. so effective work done by CPU in that 11 msec is only 11-(10*.1ms) ie 10 ms. so CPU utilization is (10/11)*100= 90%
same way, in 2nd case, there will be 11 switches(last one is of CPU bound process) if i consider 20.1 msec of time. so effective time cpu worked is 20.1-(11*.1)= 19ms. so CPU utilization (19/20.1)*100=94%
I was confused beyond belief for some reason on this question...after looking at all the answers here I finally understood through carefully looking at the jade-cheng link given by another user. There was no formula I could find in the book (maybe I missed it) but here is my version of the answer, in a kind of pseudo-formula style:
WARNING: This is probably wrong, but maybe you can show me where I went wrong.
a)
[(10 I/O processes)(1ms) + (1 cpu process)(1ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(1ms) + (10 context switches)*(0.1ms)] = 10/11 = 91%
b)
[(10 I/O processes)(1ms) + (1 cpu process)(10ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(10ms) + (10 context switches)*(0.1ms)] = 20/21 = 95%

Resources