Timing of MPI program changes dramatically after submitting to the queue

Timing of MPI program changes dramatically after submitting to the queue - mpi

I have a MPI program that create a file which have time per iteration of certain amount of calculations. When I run this code without submitting to the queue(this cluster runs SGE), it gives following time in seconds. I grabbed 8 processors using mpirun -np8.
STEP ITIME
-------------
1 0.868128
2 0.426714
3 0.409768
4 0.427312
5 0.412737
6 0.413256
7 0.414480
8 0.414984
9 0.415683
10 0.416826
But when I submit the same amount of work for 8 processors and submit it to the queue, the program take more time for calculation of the iterations. The time per step is almost four times.
STEP ITIME
-------------
1 3.189155
2 1.594365
3 1.600892
4 1.589424
5 1.605402
6 1.589136
7 1.599425
8 1.591966
9 1.601557
10 1.603447
The following bash script was used to submit the job.
#!/bin/sh
#$ -S /bin/bash
#$ -pe orte 8
export PATH=~:$PATH
/opt/openmpi/bin/mpirun -np 8 ./exec
I will appreciate if someone can point me out what might cause this issue?

In your first case (run this code without submitting to the queue), you are probably running 8 processes on the same node. That's usually fine nowadays: you've likely got 8 cores.
Try this out:
$ /opt/openmpi/bin/mpirun -np 8 uname -a
did you get 8 identical lines?
In the SGE case, you might get 8 physical machines, so now there is network communication involved. Confirm as above. I don't know SGE, but your environment no doubt has a "how to assign mpi processes" switch to indicate if you want it to assigndepth first or breadth first.

Related

How to know the netwrok traffic my test (using JMeter) is going to generate?

I am going to run load test using JMeter over Amazon AWS and I need to know before starting my test how much traffic is it going to generate over network.
The criteria that Amazon has in their policy is:
sustains, in aggregate, for more than 1 minute, over 1 Gbps (1 billion bits per second) or 1 Gpps (1 billion packets per second). If my test is going to exceed this criteria we need to submit a form before starting the test.
so how can I know if the test is going to exceed this number or not?

Run your test with 1 virtual user and 1 iteration in command-line non-GUI mode like:
jmeter -n -t test.jmx -l result.csv
To get an approximate figure open Open the result.csv file using Aggregate Report listener and there you will have 2 columns: Received KB/sec and Sent KB/sec. Multiply it by the duration of your test in seconds and you will get the number you're looking for.
alternatively you can open the result.csv file using MS Excel or LibreOffice Calc or equivalent where you can sum bytes and sentBytes columns and get the traffic with 1 byte precision:

how to make a dummy job to run for 30 minutes in controlm

I have the requirement to run dummy jobs for 30 minutes and 60 minutes respectively.
I have tried with --delay 30 in command line jobs, but I did not get the expected delay.

Designating a job as type ‘dummy’ will bypass anything contained within the command line field.
You have two options to create a 30/60minute timer job.
Option a:
Make the job a command line type job and put sleep 1800 or sleep 3600 in the command line field.
Option b:
Make the job a dummy type job and put sleep 1800 or sleep 3600 in either the pre-execution or post-execution fields.
By default the sleep command operates on seconds. For windows you may want to look into using the power shell version which would be powershell.exe -command start-sleep 1800

Use _sleep over sleep instead
Another way to enable a waiting time, either before or after an OS-type Job is by using the pre-execution or post-execution command options, as appropriate.
The use of _sleep is more convenient because it is operating system independent and is provided by the Control-M/Agent, which means that you do not require an extra deployment for that functionality.

R and GNU Parallel - How to limit number of cores used

(New to GNU Parallel)
My aim is to run the same Rscript, with different arguments, over multiple cores. My first problem is to get this working on my laptop (2 real cores, 4 virtual), then I will port this over to one with 64 cores.
Currently:
I have a Rscript, "Test.R", which takes in arguments, does a thing (say adds some numbers then writes it to a file), then stops.
I have a "commands.txt" file containing the following:
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 100
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 5 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 100 1000
/Users/name/anaconda3/lib/R/bin/Rscript Test.R 50 200 1000
So this tells GNU parallel to run Test.R using R (I have installed this using anaconda)
In the terminal (after navigating to the desktop which is where Test.R and commands.txt are) I use the command:
parallel --jobs 2 < commands.txt
What I want this to do, is to use 2 cores, and run the commands, from commands.txt, until all tasks are complete. (I have tried variations on this command, such as changing the 2 to a 1, in this case, 2 of the cores run at 100%, and the other 2 run around 20-30%).
When I run this, all of the 4 cores go to 100% (as seen from htop), and the first 2 jobs complete, and no more jobs get complete, despite all 4 cores still being at 100%.
When I run the same command on the 64 core compute, all 64 cores go to 100%, and I have to cancel the jobs.
Any advice on resources to look at, or what I am doing wrong would be greatly appreciated.
Bit of a long question, let me know if I can clarify anything.
The output from htop as requested, during running the above command (sorted by CPU%:
1 [||||||||||||||||||||||||100.0%] Tasks: 490, 490 thr; 4 running
2 [|||||||||||||||||||||||||99.3%] Load average: 4.24 3.46 4.12
3 [||||||||||||||||||||||||100.0%] Uptime: 1 day, 18:56:02
4 [||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||5.83G/8.00G]
Swp[|||||||||| 678M/2.00G]
PID USER PRI NI VIRT RES S CPU% MEM% TIME+ Command
9719 user 16 0 4763M 291M ? 182. 3.6 0:19.74 /Users/user/anaconda3
9711 user 16 0 4763M 294M ? 182. 3.6 0:20.69 /Users/user/anaconda3
7575 user 24 0 4446M 94240 ? 11.7 1.1 1:52.76 /Applications/Utilities
8833 user 17 0 86.0G 259M ? 0.8 3.2 1:33.25 /System/Library/StagedF
9709 user 24 0 4195M 2664 R 0.2 0.0 0:00.12 htop
9676 user 24 0 4197M 14496 ? 0.0 0.2 0:00.13 perl /usr/local/bin/par

Based on the output from htop the script /Users/name/anaconda3/lib/R/bin/Rscript uses more than one CPU thread (182%). You have 4 CPU threads and since you run 2 Rscripts we cannot tell if Rscript would eat all 4 CPU threads if it ran by itself. Maybe it will eat all CPU threads that are available (your test on the 64 core machine suggests this).
If you are using GNU/Linux you can limit which CPU threads a program can use with taskset:
taskset 9 parallel --jobs 2 < commands.txt
This should force GNU Parallel (and all its children) to only use CPU thread 1 and 4 (9 in binary: 1001). Thus running that should limit the two jobs to run in two threads only.
By using 9 (1001 binary) or 6 (0110 binary) we are reasonably sure that the two CPU threads are on two different cores. 3 (11 binary) might refer to the two threads on the came CPU core and would therefore probably be slower. The same goes for 5 (101 binary).
In general you want to use as many CPU threads as possible as that will typically make the computation faster. It is unclear from your question why you want to avoid this.
If you are sharing the server with others a better solution is to use nice. This way you can use all the CPU power that others are not using.

5 Stage Datapath - Multi-cycle without pipeline

I have a 5 stage Datapath with the following steps' times:
Fetch 190ps
Decode 120ps
Alu 170ps
Memory 200ps
Writeback 120ps
It's asked to calculate how many instructions can be executed in 1us knowing that the processor is working in multi-cycle without pipeline and that the clock is optimised.
I know that if processor was pipelined and the pipeline was initially empty, the number of instructions would be 4996 by doing:
200ps (longest stage's time) -> 1 instruction
1 us -> x
x=5000
Nº of instructions = 5000-4=4996
Since there's no pipeline on this case what I did was:
190ps+120ps+170ps+200ps+120ps = 800ps
800ps -> 1 instruction
1 us -> x
x = 1250 instructions
However the correct answer is 1000 instructions.
Can someone explain me why?
Thank you

Clear unused devices at /devices Solaries 10

When i use command format the output is:
AVAILABLE DISK SELECTIONS:
0. c0d0 <DEFAULT cyl 1302 alt 2 hd 255 sec 63>
/pci#0,0/pci-ide#7,1/ide#0/cmdk#0,0
1. c2t0d0 <DEFAULT cyl 1020 alt 2 hd 64 sec 32>
/pci#0,0/pci15ad,1976#10/sd#0,0
But after searching in /dev/dsk $ /dev/rdsk using ls i found:
bash-3.00# ls
c0d0p0 c0d0s11 c0d0s5 c1t0d0p3 c1t0d0s14 c1t0d0s8 c2t0d0s1 c2t0d0s3
c0d0p1 c0d0s12 c0d0s6 c1t0d0p4 c1t0d0s15 c1t0d0s9 c2t0d0s10 c2t0d0s4
c0d0p2 c0d0s13 c0d0s7 c1t0d0s0 c1t0d0s2 c2t0d0p0 c2t0d0s11 c2t0d0s5
c0d0p3 c0d0s14 c0d0s8 c1t0d0s1 c1t0d0s3 c2t0d0p1 c2t0d0s12 c2t0d0s6
c0d0p4 c0d0s15 c0d0s9 c1t0d0s10 c1t0d0s4 c2t0d0p2 c2t0d0s13 c2t0d0s7
c0d0s0 c0d0s2 c1t0d0p0 c1t0d0s11 c1t0d0s5 c2t0d0p3 c2t0d0s14 c2t0d0s8
c0d0s1 c0d0s3 c1t0d0p1 c1t0d0s12 c1t0d0s6 c2t0d0p4 c2t0d0s15 c2t0d0s9
c0d0s10 c0d0s4 c1t0d0p2 c1t0d0s13 c1t0d0s7 c2t0d0s0 c2t0d0s2
Question 1
I know that c0d0p0 is fdisk partitions because i'm on x86 system not spark but still i don't understand why it appeared even though i never used fdisk?
Question 2
As you saw at format output i only have c0d0 [IDE] and c2t0d0 [SCSI] but i don't have c1t0d0s0 ?!! i even used devfsadm -C and still it exists.
i used format /dev/rdsk/c1t0d0s0 and told me No disk found!
I dont understand what is this exactly and using ls -l is sure points on a device file at /device
bash-3.00# ls -l c1t0d0s0
lrwxrwxrwx 1 root root 52 Nov 29 2012 c1t0d0s0 -> ../../devices/pci#0,0/pci-ide#7,1/ide#1/sd#0,0:a,raw
so can you please tell me what is that exactly and how can i remove it?

1: No need to use fdisk to get c0d0p0, the OS provision every possible entry (partition/slice) regardless of whether they actually exist or not.
2: This device is likely not handled by format, might a CD/DVD drive or a remote device (USB key, drive, ...)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Timing of MPI program changes dramatically after submitting to the queue - mpi

Related

How to know the netwrok traffic my test (using JMeter) is going to generate?

how to make a dummy job to run for 30 minutes in controlm

R and GNU Parallel - How to limit number of cores used

5 Stage Datapath - Multi-cycle without pipeline

Clear unused devices at /devices Solaries 10

Categories

Resources