I am wondering why #btime reports one memory allocation per element in basic loops like these ones:
julia> using BenchmarkTools
julia> v=[1:15;]
15-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
julia> #btime for vi in v end
1.420 μs (15 allocations: 480 bytes)
julia> #btime for i in eachindex(v) v[i]=-i end
2.355 μs (15 allocations: 464 bytes)
I do not know how to interpret this result:
is it a bug/artifact of #btime?
is there really one alloc per element? (this would ruin performance...)
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2603 v3 # 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
You're benchmarking access to the global variable v, which is the very first performance tip you should be aware of.
With BenchmarkTools you can work around that by interpolating v:
julia> #btime for vi in v end
555.962 ns (15 allocations: 480 bytes)
julia> #btime for vi in $v end
1.630 ns (0 allocations: 0 bytes)
But note that in general it's better to put your code in functions. The global scope is just bad for performance:
julia> f(v) = for vi in v end
f (generic function with 1 method)
julia> #btime f(v)
11.410 ns (0 allocations: 0 bytes)
julia> #btime f($v)
1.413 ns (0 allocations: 0 bytes)
Related
I have two collections of the same size - I have partitioned the first one and would like to apply the same partition to the second one. Is there an elegant way of doing this? I have found the following but it seems ugly...
(def x (range 50 70))
(def y [(range 5) (range 10) (range 3) (range 2)] ;my partition of 20 items
(drop-last
(reduce (fn [a b] (concat (drop-last a)
(split-at (count b) (last a))))
[x] y))
i would propose a slightly different approach, using the collections manipulation functions:
(defn split-like [pattern data]
(let [sizes (map count pattern)]
(->> sizes
(reductions #(drop %2 %1) data)
(map take sizes))))
user> (split-like y x)
;;=> ((50 51 52 53 54) (55 56 57 58 59 60 61 62 63 64) (65 66 67) (68 69))
the idea is to collect corresponding tails by reductions with drop:
user> (reductions (fn [acc x] (drop x acc)) (range 20) [1 2 3 4])
;;=> ((0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
;; (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
;; (3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
;; (6 7 8 9 10 11 12 13 14 15 16 17 18 19)
;; (10 11 12 13 14 15 16 17 18 19))
and then just to take needed amounts from that tails:
user> (map take [1 2 3 4] *1)
;;=> ((0) (1 2) (3 4 5) (6 7 8 9))
Similar in spirit to your solution, but I think it is easier to read a straightforward loop/recur structure than a somewhat unconventional reduce:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test))
; we want to partition x into the same shape as y
(verify
; ideally should have error checking to ensure have enough x values, etc
(let [x (range 50 70)
y [(range 5) (range 10) (range 3) (range 2)] ;my partition of 20 items
lengths (mapv count y)
x2 (loop [xvals x
lens lengths
result []]
(if (empty? lens)
result ; return result when no more segments wanted
(let [len-curr (first lens)
xvals-next (drop len-curr xvals)
lens-next (rest lens)
result-next (conj result (vec (take len-curr xvals)))]
(recur xvals-next lens-next result-next))))]
(is= lengths [5 10 3 2])
(is= x2
[[50 51 52 53 54]
[55 56 57 58 59 60 61 62 63 64]
[65 66 67]
[68 69]])))
When using loop/recur, I quite like the readability of making explicit the *-next values that will be passed into the succeeding loop. I find it unnecessarily difficult to read code that does everything inline in a big, complicated recur form.
Also, since Clojure data is immutable, it doesn't matter that I computer xvals-next before using the current xvals to compute result-next.
Built using my favorite template project.
(require '[com.rpl.specter :as s])
(let [x (range 50 70)
y [(range 5) (range 10) (range 3) (range 2)]]
(s/setval [s/ALL (s/subselect s/ALL)] x y))
I have compiled Quantum ESPRESSO (Program PWSCF v.6.7MaX) for GPU acceleration (hibrid MPI/OpenMP) with the next options:
module load compiler/intel/2020.1
module load hpc_sdk/20.9
./configure F90=pgf90 CC=pgcc MPIF90=mpif90 --with-cuda=yes --enable-cuda-env-check=no --with-cuda-runtime=11.0 --with-cuda-cc=70 --enable-openmp BLAS_LIBS='-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core'
make -j8 pw
Apparently, the compilation ends succesfully. Then, I execute the program:
export OMP_NUM_THREADS=1
mpirun -n 2 /home/my_user/q-e-gpu-qe-gpu-6.7/bin/pw.x < silverslab32.in > silver4.out
Then, the program starts running and print out the next info:
Parallel version (MPI & OpenMP), running on 8 processor cores
Number of MPI processes: 2
Threads/MPI process: 4
...
GPU acceleration is ACTIVE
...
Estimated max dynamical RAM per process > 13.87 GB
Estimated total dynamical RAM > 27.75 GB
But after 2 minutes of execution the job ends with error:
0: ALLOCATE: 4345479360 bytes requested; status = 2(out of memory)
0: ALLOCATE: 4345482096 bytes requested; status = 2(out of memory)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[47946,1],1]
Exit code: 127
--------------------------------------------------------------------------
This node has > 180GB of available RAM. I check the Memory use with the top command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
89681 my_user 20 0 30.1g 3.6g 2.1g R 100.0 1.9 1:39.45 pw.x
89682 my_user 20 0 29.8g 3.2g 2.0g R 100.0 1.7 1:39.30 pw.x
I noticed that the process stops when RES memory reaches 4GB. This are the caracteristics of the node:
(base) [my_user#gpu001]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 95313 MB
node 0 free: 41972 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 96746 MB
node 1 free: 70751 MB
node distances:
node 0 1
0: 10 21
1: 21 10
(base) [my_user#gpu001]$ free -lm
total used free shared buff/cache available
Mem: 192059 2561 112716 260 76781 188505
Low: 192059 79342 112716
High: 0 0 0
Swap: 8191 0 8191
The version of MPI is:
mpirun (Open MPI) 3.1.5
This node is a compute node in a cluster, but no matter if I submit the job with SLURM or run it directly on the node, the error is the same.
Note that I compile it on the login node and run it on this GPU node, the difference is that on the login node it has no GPU connected.
I would really appreciate it if you could help me figure out what could be going on.
Thank you in advance!
I'm puzzled with results of computing sequence by different ways:
(defun fig-square-p (n)
"Check if the given N is a perfect square number.
A000290 in the OEIS"
(check-type n (integer 0 *))
(= (floor (sqrt n)) (ceiling (sqrt n))))
(defun fibonaccip (n)
"Check if the given number N is a Fibonacci number."
(check-type n (integer 0 *))
(or (fig-square-p (+ (* 5 (expt n 2)) 4))
(fig-square-p (- (* 5 (expt n 2)) 4))))
(defun fibonacci (n)
"Compute N's Fibonacci number."
(check-type n (integer 0 *))
(loop :for f1 = 0 :then f2
:and f2 = 1 :then (+ f1 f2)
:repeat n :finally (return f1)))
(defun seq-fibonaccies (n)
"Return sequence of Fibonacci numbers upto N."
(check-type n (integer 0 *))
(loop :for i :from 1 :upto n
:collect (fib i)))
CL-USER> (loop :for i :from 0 :upto 7070 :when (fibonaccip i) :collect i)
(0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 2889 3876 4181 5473
6765 7070)
CL-USER> (seq-fibonaccies 21)
(1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946)
And when I increasing the limit of the loop the results start diverging much more.
~$ sbcl --version
SBCL 1.4.14-3.fc31
As someone else already mentioned, rounding errors accumulate quickly,
so you should stick with the integer arithmetics, using
isqrt:
(defun squarep (n)
"Check if the given N is a perfect square number.
https://oeis.org/A000290"
(check-type n (integer 0 *))
(let ((sqrt (isqrt n)))
(= n (* sqrt sqrt))))
Also, you have a typo in seq-fibonaccies (fib instead of fibonacci).
Finally, seq-fibonaccies is quadratic in its argument, while it only
have to be linear.
fibonaccip is going to give an estimated value because you don't have exact values for the square root of 5. As the value of n increases, the error will also incresase.
Consider the following 4 functions in Julia: They all pick/compute a random column of a matrix A and adds a constant times this column to a vector z.
The difference between slow1 and fast1 is how z is updated and likewise for slow2 and fast2.
The difference between the 1 functions and 2 functions is whether the matrix A is passed to the functions or computed on the fly.
The odd thing is that for the 1 functions, fast1 is faster (as I would expect when using BLAS instead of +=), but for the 2 functions slow1 is faster.
On this computer I get the following timings (for the second run of each function):
#time slow1(A, z, 10000);
0.172560 seconds (110.01 k allocations: 940.102 MB, 12.98% gc time)
#time fast1(A, z, 10000);
0.142748 seconds (50.07 k allocations: 313.577 MB, 4.56% gc time)
#time slow2(complex(float(x)), complex(float(y)), z, 10000);
2.265950 seconds (120.01 k allocations: 1.529 GB, 1.20% gc time)
#time fast2(complex(float(x)), complex(float(y)), z, 10000);
4.351953 seconds (60.01 k allocations: 939.410 MB, 0.43% gc time)
Is there an explanation to this behaviour? And a way to make BLAS faster than +=?
M = 2^10
x = [-M:M-1;]
N = 2^9
y = [-N:N-1;]
A = cis( -2*pi*x*y' )
z = rand(2*M) + rand(2*M)*im
function slow1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:size(A,2);]
for iter = 1:maxiter
idx = rand(S)
col = A[:,idx]
a = rand()
z += a*col
end
end
function fast1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:size(A,2);]
for iter = 1:maxiter
idx = rand(S)
col = A[:,idx]
a = rand()
BLAS.axpy!(a, col, z)
end
end
function slow2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:length(y);]
for iter = 1:maxiter
idx = rand(S)
col = cis( -2*pi*x*y[idx] )
a = rand()
z += a*col
end
end
function fast2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:length(y);]
for iter = 1:maxiter
idx = rand(S)
col = cis( -2*pi*x*y[idx] )
a = rand()
BLAS.axpy!(a, col, z)
end
end
Update:
Profiling slow2:
2260 task.jl; anonymous; line: 92
2260 REPL.jl; eval_user_input; line: 63
2260 profile.jl; anonymous; line: 16
2175 /tmp/axpy.jl; slow2; line: 37
10 arraymath.jl; .*; line: 118
33 arraymath.jl; .*; line: 120
5 arraymath.jl; .*; line: 125
46 arraymath.jl; .*; line: 127
3 complex.jl; cis; line: 286
3 complex.jl; cis; line: 287
2066 operators.jl; cis; line: 374
72 complex.jl; cis; line: 286
1914 complex.jl; cis; line: 287
1 /tmp/axpy.jl; slow2; line: 38
84 /tmp/axpy.jl; slow2; line: 39
5 arraymath.jl; +; line: 96
39 arraymath.jl; +; line: 98
6 arraymath.jl; .*; line: 118
34 arraymath.jl; .*; line: 120
Profiling fast2:
4288 task.jl; anonymous; line: 92
4288 REPL.jl; eval_user_input; line: 63
4288 profile.jl; anonymous; line: 16
1 /tmp/axpy.jl; fast2; line: 47
1 random.jl; rand; line: 214
3537 /tmp/axpy.jl; fast2; line: 48
26 arraymath.jl; .*; line: 118
44 arraymath.jl; .*; line: 120
1 arraymath.jl; .*; line: 122
4 arraymath.jl; .*; line: 125
53 arraymath.jl; .*; line: 127
7 complex.jl; cis; line: 286
3399 operators.jl; cis; line: 374
116 complex.jl; cis; line: 286
3108 complex.jl; cis; line: 287
2 /tmp/axpy.jl; fast2; line: 49
748 /tmp/axpy.jl; fast2; line: 50
748 linalg/blas.jl; axpy!; line: 231
Oddly, the computing time of col differs even though the functions are identical up to this point.
But += is still relatively faster than axpy!.
Some more info now that julia 0.6 is out. To multiply a vector by a scalar in place, there are at least four options. Following Tim's suggstions, I used BenchmarkTool's #btime macro. It turns out that loop fusion, the most julian way to write it, is on par with calling BLAS. That's something the julia developers can be proud of!
using BenchmarkTools
function bmark(N)
a = zeros(N);
#btime $a *= -1.;
#btime $a .*= -1.;
#btime LinAlg.BLAS.scal!($N, -1.0, $a, 1);
#btime scale!($a, -1.);
end
And the results for 10^5 numbers.
julia> bmark(10^5);
78.195 μs (2 allocations: 781.33 KiB)
35.102 μs (0 allocations: 0 bytes)
34.659 μs (0 allocations: 0 bytes)
34.664 μs (0 allocations: 0 bytes)
The profiling backtrace shows that scale! just calls blas in the background, so they should give the same best time.
I am submitting a job using qsub that runs parallelized R. My
intention is to have R programme running on 4 different cores rather than 8 cores. Here are some of my settings in PBS file:
#PBS -l nodes=1:ppn=4
....
time R --no-save < program1.R > program1.log
I am issuing the command ta job_id and I'm seeing that 4 cores are listed. However, the job occupies a large amount of memory(31944900k used vs 32949628k total). If I were to use 8 cores, the jobs got hang due to memory limitation.
top - 21:03:53 up 77 days, 11:54, 0 users, load average: 3.99, 3.75, 3.37
Tasks: 207 total, 5 running, 202 sleeping, 0 stopped, 0 zombie
Cpu(s): 30.4%us, 1.6%sy, 0.0%ni, 66.8%id, 0.0%wa, 0.0%hi, 1.2%si, 0.0%st
Mem: 32949628k total, 31944900k used, 1004728k free, 269812k buffers
Swap: 2097136k total, 8360k used, 2088776k free, 6030856k cached
Here is a snapshot when issuing command ta job_id
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1794 x 25 0 6247m 6.0g 1780 R 99.2 19.1 8:14.37 R
1795 x 25 0 6332m 6.1g 1780 R 99.2 19.4 8:14.37 R
1796 x 25 0 6242m 6.0g 1784 R 99.2 19.1 8:14.37 R
1797 x 25 0 6322m 6.1g 1780 R 99.2 19.4 8:14.33 R
1714 x 18 0 65932 1504 1248 S 0.0 0.0 0:00.00 bash
1761 x 18 0 63840 1244 1052 S 0.0 0.0 0:00.00 20016.hpc
1783 x 18 0 133m 7096 1128 S 0.0 0.0 0:00.00 python
1786 x 18 0 137m 46m 2688 S 0.0 0.1 0:02.06 R
How can I prevent other users from using the other 4 cores? I like to mask somehow that my job is using 8 cores with 4 cores idling.
Could anyone kindly help me out on this? Can this be solved using pbs?
Many Thanks
"How can I prevent other users from using the other 4 cores? I like to mask somehow that my job is using 8 cores with 4 cores idling."
Maybe a simple way around it is to send a 'sleep' job on the other 4? Seems hackish though! (ans warning, my PBS is rusty!)
Why not do the following -
ask PBS for ppn=4, additionally, ask for all the memory on the node, i e
#PBS -l nodes=1:ppn=4 -l mem=31944900k
This might not be possible on your setup.
I am not sure how R is parallelized, but if it is OPENMP you could definitely ask for 8 cores but set OMP_NUM_THREADS to 4