Parallelize two (or more) functions in julia - julia

I am trying to solve some wave equation problem (related to my Phd) using finite difference method. For this, I have translated (line by line) a fortran code (link below): (https://github.com/geodynamics/seismic_cpml/blob/master/seismic_CPML_2D_anisotropic.f90)
Inside these code and within the time loop, there are four main loops that are independent. In fact, I could arrange them into four functions.
As I have to run this code about a hundred times, it would be nice to speed up the process. In this sense, I am turning my eyes toward parallelization. See below, as an example:
function main()
...some common code...
for time=1:N
function fun1() # I want this function to run parallel...
function fun2() # ..this function to run parallel with 1,3,4
function fun3() # ..This function to run parallel with 2,3,4
function fun4() # ..This function to run parallel with 1,2,3
end
... more code here...
return
end
So,
1) Is it possible to do what I mention before?
2) Will this approach speed up my code?
3) Is there a better way to think this problem?
A minimal working example could be like this:
function fun1(t)
for i=1:1000
for j=1:1000
t+=(0.5)^t+(0.3)^(t-1);
end
end
return t
end
function fun2(t)
for i=1:1000
for j=1:1000
t+=(0.5)^t;
end
end
return t
end
function fun3(r)
for i=1:1000
for j=1:1000
r = (r + rand())/r;
end
end
return r
end
function main()
a = 2;
b = 2.5;
c = 3.0;
for i=1:100
a = fun1(a);
b = fun2(b);
c = fun3(c);
end
return;
end
So, As can be seen, non of the three functions above (fun1, fun2 & fun3) depend from any ohter, so they can sure run parallel. can these be achieved?, will it bust my computational speed?
Edited:
Hi #BogumiłKamiński I have altered the finite-Diff-eq in order to implement a "loop" (as you sugested) over the inputs and outputs of my functions. If there is no much trouble, I would like your opinion over the parellelization design of the code:
Key elements
1) I have packed all inputs in 4 tuples: sig_xy_in and sig_xy_cros_in (for the 2 sigma functions) and vel_vx_in and vel_vy_in (for 2 velocity functions). I then packed the 4 tuples into 2 vectors for "looping" purposes...
2) I packed the 4 functions in 2 vectors for "looping" purposes...
3) I run the first parallel loop and then unpack its output tuple...
4) I run the second parallel loop(for velocities) and then unpack its output tuple...
5) finally, I packed the outputed elements into the inputs tuples and continue the time loop until finish..
...code
l = Threads.SpinLock()
arg_in_sig = [sig_xy_in,sig_xy_cros_in]; # Inputs tuples x sigma funct
arg_in_vel = [vel_vx_in, vel_vy_in]; # Inputs tuples x velocity funct
func_sig = [sig_xy , sig_xy_cros]; # Vector with two sigma functions
func_vel = [vel_vx , vel_vy]; # Vector with two velocity functions
for it = 1:NSTEP # time steps
#------------------------------------------------------------
# Compute sigma functions
#------------------------------------------------------------
Threads.#threads for j in 1:2 # Star parallel of two sigma functs
Threads.lock(l);
Threads.unlock(l);
arg_in_sig[j] = func_sig[j](arg_in_sig[j]);
end
# Unpack tuples for sig_xy and sig_xy_cros
# Unpack tuples for sig_xy
sigxx = arg_in_sig[1][1]; # changed by sig_xy
sigyy = arg_in_sig[1][2]; # changed by sig_xy
m_dvx_dx = arg_in_sig[1][3]; # changed by sig_xy
m_dvy_dy = arg_in_sig[1][4]; # changed by sig_xy
vx = arg_in_sig[1][5]; # unchanged by sig_xy
vy = arg_in_sig[1][6]; # unchanged by sig_xy
delx_1 = arg_in_sig[1][7]; # unchanged by sig_xy
dely_1 = arg_in_sig[1][8]; # unchanged by sig_xy
...more unpacking...
# Unpack tuples for sig_xy_cros
sigxy = arg_in_sig[2][1]; # changed by sig_xy_cros
m_dvy_dx = arg_in_sig[2][2]; # changed by sig_xy_cros
m_dvx_dy = arg_in_sig[2][3]; # changed by sig_xy_cros
vx = arg_in_sig[2][4]; # unchanged by sig_xy_cros
vy = arg_in_sig[2][5]; # unchanged by sig_xy_cros
...more unpacking....
#--------------------------------------------------------
# velocity
#--------------------------------------------------------
Threads.#threads for j in 1:2 # Start parallel ot two velocity funct
Threads.lock(l)
Threads.unlock(l)
arg_in_vel[j] = func_vel[j](arg_in_vel[j])
end
# Unpack tuples for vel_vx
vx = arg_in_vel[1][1]; # changed by vel_vx
m_dsigxx_dx = arg_in_vel[1][2]; # changed by vel_vx
m_dsigxy_dy = arg_in_vel[1][3]; # changed by vel_vx
sigxx = arg_in_vel[1][4]; # unchanged changed by vel_vx
sigxy = arg_in_vel[1][5];....
# Unpack tuples for vel_vy
vy = arg_in_vel[2][1]; # changed changed by vel_vy
m_dsigxy_dx = arg_in_vel[2][2]; # changed changed by vel_vy
m_dsigyy_dy = arg_in_vel[2][3]; # changed changed by vel_vy
sigxy = arg_in_vel[2][4]; # unchanged changed by vel_vy
sigyy = arg_in_vel[2][5]; # unchanged changed by vel_vy
.....
...more unpacking...
# ensamble new input variables
sig_xy_in = (sigxx,sigyy,
m_dvx_dx,m_dvy_dy,
vx,vy,....);
sig_xy_cros_in = (sigxy,
m_dvy_dx,m_dvx_dy,
vx,vy,....;
vel_vx_in = (vx,....
vel_vy_in = (vy,.....
end #time loop

Here is a simple way to run your code in multithreading mode:
function fun1(t)
for i=1:1000
for j=1:1000
t+=(0.5)^t+(0.3)^(t-1);
end
end
return t
end
function fun2(t)
for i=1:1000
for j=1:1000
t+=(0.5)^t;
end
end
return t
end
function fun3(r)
for i=1:1000
for j=1:1000
r = (r + rand())/r;
end
end
return r
end
function main()
l = Threads.SpinLock()
a = [2.0, 2.5, 3.0]
f = [fun1, fun2, fun3]
Threads.#threads for i in 1:3
for j in 1:4
Threads.lock(l)
println((thread=Threads.threadid(), iteration=j))
Threads.unlock(l)
a[i] = f[i](a[i])
end
end
return a
end
I have added locking - just as an example how you can do it (in Julia 1.3 you would not have to do this as IO is thread safe there).
Also note that rand() is sharing data among threads prior to Julia 1.3 so it would be not safe to run these functions if all of them used rand() (again in Julia 1.3 it would be safe to do so).
To run this code first set the maximum number of threads you want to use e.g. like this on Windows: set JULIA_NUM_THREADS=4 (in Linux you should export). Here is an example of this code run (I have reduced the number of iterations done in order to shorten the output):
julia> main()
(thread = 1, iteration = 1)
(thread = 3, iteration = 1)
(thread = 2, iteration = 1)
(thread = 3, iteration = 2)
(thread = 3, iteration = 3)
(thread = 3, iteration = 4)
(thread = 2, iteration = 2)
(thread = 1, iteration = 2)
(thread = 2, iteration = 3)
(thread = 2, iteration = 4)
(thread = 1, iteration = 3)
(thread = 1, iteration = 4)
3-element Array{Float64,1}:
21.40311930108456
21.402807510451463
1.219028489573526
Now one smal cautionary note - while it is relatively easy to make code multithreaded in Julia (and in Julia 1.3 it will be even simpler) you have to be careful when you do it as you have to take care of race conditions.

Related

Julia UndefVarError on Metaprogramming

I'm trying to do a solver for equations. When I run the code the X variable appears to be undefined, but it prints out perfectly. What am I missing?
I should give the program some numbers, than operations as Macros and it should create an outer product matrix of the operations applied.
function msu()
print("Insert how many values: ")
quantity = parse(Int64, readline())
values = []
for i in 1:quantity
println("x$i")
num1 = parse(Float64, readline())
push!(values, num1)
end
println(values)
print("How many operations? ")
quantity = parse(Int64, readline())
ops = []
for i in 1:quantity
push!(ops, Meta.parse(readline()))
end
mat = zeros((quantity, quantity))
for i in 1:length(mat)
sum = 0
for j in 1:length(values)
# here begins problems, the following prints are for debugging purpose
print(length(values))
func = Meta.parse("$(ops[convert(Int64, ceil(j / quantity))]) * $(ops[convert(Int64, j % quantity)])")
print(func)
x = values[j]
println(x)
sum += eval(func)
end
mat[i] = sum
end
println(mat)
end
msu()
The original code was in Spanish, if you find any typo it's probably because I skipped a translation.

#distributed seems to work, function return is wonky

I'm just learning how to do parallel computing in Julia. I'm using #sync #distributed at the start of a 3x nested for loop to parallelize things (see code at bottom). From the line println(errCmp[row, col]) I can watch all the elements of the array errCmp be printed out. E.g.
From worker 3: 2.351134946074191e9
From worker 4: 2.3500830193505473e9
From worker 5: 2.3502416529551845e9
From worker 2: 2.3509105625656652e9
From worker 3: 2.3508352842971106e9
From worker 4: 2.3497049296121807e9
From worker 5: 2.35048428351797e9
From worker 2: 2.350742582031195e9
From worker 3: 2.350616273660934e9
From worker 4: 2.349709546599313e9
However, when the function returns, errCmp is the array of zeros I pre-allocate at the begging.
Am I missing some closing term to collect everything?
function optimizeDragCalc(df::DataFrame)
paramGrid = [cd*AoM for cd = range(1e-3, stop = 0.01, length = 50), AoM = range(2e-4, stop = 0.0015, length = 50)]
errCmp = zeros(size(paramGrid))
# totalSize = size(paramGrid, 1) * size(paramGrid, 2) * size(df.time, 1)
#sync #distributed for row = 1:size(paramGrid, 1)
for col = 1:size(paramGrid, 2)
# Run the propagation here
BC = 1/paramGrid[row, col]
slns, _ = propWholeTraj(df, BC)
for time = 1:size(df.time, 1)
errDF = propError(slns[time], df, time)
errCmp[row, col] += sum(errDF.totalErr)
end # time
# println("row: ", row, " of ",size(paramGrid, 1)," col: ", col, " of ", size(paramGrid, 2))
println(errCmp[row, col])
end # col
end # row
# plot(heatmap(z = errCmp))
return errCmp, paramGrid
end
errCmp, paramGrid = #time optimizeDragCalc(df)
You did not provide a minimal working example but I guess it might be hard. So here is mine MWE. Let us assume that we want to use Distributed to calculate sums of Array's columns:
using Distributed
addprocs(2)
#everywhere using StatsBase
data = rand(1000,2000)
res = zeros(2000)
#sync #distributed for col = 1:size(data)[2]
res[col] = StatsBase.mean(data[:,col])
# does not work!
# ... because data is created locally and never returned!
end
In order to correct the above code you need to provide an aggregator function (I keep the example intentionally simplified - a further optimization is possible).
using Distributed
addprocs(2)
#everywhere using Distributed,StatsBase
data = rand(1000,2000)
#everywhere function t2(d1,d2)
append!(d1,d2)
d1
end
res = #sync #distributed (t2) for col = 1:size(data)[2]
[(myid(),col, StatsBase.mean(data[:,col]))]
end
Now let us see the output. We can see that some of the values have been calculated on worker 2 while others on worker 3:
julia> res
2000-element Array{Tuple{Int64,Int64,Float64},1}:
(2, 1, 0.49703681326230276)
(2, 2, 0.5035341367791002)
(2, 3, 0.5050607022354537)
⋮
(3, 1998, 0.4975699181976122)
(3, 1999, 0.5009498778934444)
(3, 2000, 0.499671315490524)
Further possible improvements/modifications:
use #spawnat to generate values at remote processes (instead of the master process and sending them)
use SharedArray - this allows to automatically distribute data among workers. From my experience requires very careful programming.
use ParallelDataTransfer.jl to send data among workers. Very easy to use, not efficient for huge number of messages.
always consider Julia threading mechanism (in some scenarios it makes life easier - again depends on the problem)

Julia pmap performance

I am trying to port some of my R code to Julia;
Basically I have rewritten the following R code in Julia:
library(parallel)
eps_1<-rnorm(1000000)
eps_2<-rnorm(1000000)
large_matrix<-ifelse(cbind(eps_1,eps_2)>0,1,0)
matrix_to_compare = expand.grid(c(0,1),c(0,1))
indices<-seq(1,1000000,4)
large_matrix<-lapply(indices,function(i)(large_matrix[i:(i+3),]))
function_compare<-function(x){
which((rowSums(x==matrix_to_compare)==2) %in% TRUE)
}
> system.time(lapply(large_matrix,function_compare))
user system elapsed
38.812 0.024 38.828
> system.time(mclapply(large_matrix,function_compare,mc.cores=11))
user system elapsed
63.128 1.648 6.108
As one can notice I am getting significant speed-up when going from one core to 11. Now I am trying to do the same in Julia:
#Define cluster:
addprocs(11);
using Distributions;
#everywhere using Iterators;
d = Normal();
eps_1 = rand(d,1000000);
eps_2 = rand(d,1000000);
#Create a large matrix:
large_matrix = hcat(eps_1,eps_2).>=0;
indices = collect(1:4:1000000)
#Split large matrix:
large_matrix = [large_matrix[i:(i+3),:] for i in indices];
#Define the function to apply:
#everywhere function function_split(x)
matrix_to_compare = transpose(reinterpret(Int,collect(product([0,1],[0,1])),(2,4)));
matrix_to_compare = matrix_to_compare.>0;
find(sum(x.==matrix_to_compare,2).==2)
end
#time map(function_split,large_matrix )
#time pmap(function_split,large_matrix )
5.167820 seconds (22.00 M allocations: 2.899 GB, 12.83% gc time)
18.569198 seconds (40.34 M allocations: 2.082 GB, 5.71% gc time)
As one can notice I am not getting any speed up with pmap. Maybe somebody can suggest alternatives.
I think that some of the problem here is that #parallel and #pmap don't always handle moving data to and from the workers very well. Thus, they tend to work best in situations where what you are executing doesn't require very much data movement at all. I also suspect that there are probably things that could be done to improve their performance, but I'm not certain on the details.
For situations in which you do need more data moving around, it may be best to stick with options that directly call functions on workers, with those functions then accessing objects within the memory space of those workers. I give one example below, which speeds up your function using multiple workers. It uses perhaps the simplest option, which is #everywhere, but #spawn, remotecall() etc. are also worth considering, depending on your situation.
addprocs(11);
using Distributions;
#everywhere using Iterators;
d = Normal();
eps_1 = rand(d,1000000);
eps_2 = rand(d,1000000);
#Create a large matrix:
large_matrix = hcat(eps_1,eps_2).>=0;
indices = collect(1:4:1000000);
#Split large matrix:
large_matrix = [large_matrix[i:(i+3),:] for i in indices];
large_matrix = convert(Array{BitArray}, large_matrix);
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
#everywhere function function_split(x::BitArray)
matrix_to_compare = transpose(reinterpret(Int,collect(product([0,1],[0,1])),(2,4)));
matrix_to_compare = matrix_to_compare.>0;
find(sum(x.==matrix_to_compare,2).==2)
end
function distribute_data(X::Array, WorkerName::Symbol)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
end
#spawnat(pid, eval(Main, Expr(:(=), WorkerName, X[StartIdx:EndIdx])))
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
distribute_data(large_matrix, :large_matrix)
function parallel_split()
#everywhere begin
if myid() != 1
result = map(function_split,large_matrix );
end
end
results = cell(nworkers())
for (idx, pid) in enumerate(workers())
results[idx] = getfrom(pid, :result)
end
vcat(results...)
end
## results given after running once to compile
#time a = map(function_split,large_matrix); ## 6.499737 seconds (22.00 M allocations: 2.899 GB, 13.99% gc time)
#time b = parallel_split(); ## 1.097586 seconds (1.50 M allocations: 64.508 MB, 3.28% gc time)
julia> a == b
true
Note: even with this, the speedup is not perfect from the multiple processes. But, this is to be expected, since there is still a moderate amount of data to be returned as a result of your function, and that data's got to be moved, taking time.
P.S. See this post (Julia: How to copy data to another processor in Julia) or this package (https://github.com/ChrisRackauckas/ParallelDataTransfer.jl) for more on the sendto and getfrom functions I used here.

Julia: swap gives errors

I'm using Julia 0.3.4
I'm trying to write LU-decomposition using Gaussian elimination. So I have to swap rows. And here's my problem:
If I'm using a,b = b,a I get an error,
but if I'm using:
function swapRows(row1, row2)
temp = row1
row1 = row2
row2 = temp
end
then everything works just fine.
Am I doing something wrong or it's a bug?
Here's my source code:
function lu_t(A::Matrix)
# input value: (A), where A is a matrix
# return value: (L,U), where L,U are matrices
function swapRows(row1, row2)
temp = row1
row1 = row2
row2 = temp
return null
end
if size(A)[1] != size(A)[2]
throw(DimException())
end
n = size(A)[1] # matrix dimension
U = copy(A) # upper triangular matrix
L = eye(n) # lower triangular matrix
for k = 1:n-1 # direct Gaussian elimination for each column `k`
(val,id) = findmax(U[k:end,k]) # find max pivot element and it's row `id`
if val == 0 # check matrix for singularity
throw(SingularException())
end
swapRows(U[k,k:end],U[id,k:end]) # swap row `k` and `id`
# U[k,k:end],U[id,k:end] = U[id,k:end],U[k,k:end] - error
for i = k+1:n # for each row `i` > `k`
μ = U[i,k] / U[k,k] # find elimination coefficient `μ`
L[i,k] = μ # save to an appropriate position in lower triangular matrix `L`
for j = k:n # update each value of the row `i`
U[i,j] = U[i,j] - μ⋅U[k,j]
end
end
end
return (L,U)
end
###### main code ######
A = rand(4,4)
#time (L,U) = lu_t(A)
#test_approx_eq(L*U, A)
The swapRows function is a no-op and has no effect whatsoever – all it does is swap around some local variable names. See various discussions of the difference between assignment and mutation:
https://groups.google.com/d/msg/julia-users/oSW5hH8vxAo/llAHRvvFVhMJ
http://julia.readthedocs.org/en/latest/manual/faq/#i-passed-an-argument-x-to-a-function-modified-it-inside-that-function-but-on-the-outside-the-variable-x-is-still-unchanged-why
http://julia.readthedocs.org/en/latest/manual/faq/#why-does-x-y-allocate-memory-when-x-and-y-are-arrays
The constant null doesn't mean what you think it does – in Julia v0.3 it's a function that computes the null space of a linear transformation; in Julia v0.4 it still means this but has been deprecated and renamed to nullspace. The "uninteresting" value in Julia is called nothing.
I'm not sure what's wrong with your commented out row swapping code, but this general approach does work:
julia> X = rand(3,4)
3x4 Array{Float64,2}:
0.149066 0.706264 0.983477 0.203822
0.478816 0.0901912 0.810107 0.675179
0.73195 0.756805 0.345936 0.821917
julia> X[1,:], X[2,:] = X[2,:], X[1,:]
(
1x4 Array{Float64,2}:
0.478816 0.0901912 0.810107 0.675179,
1x4 Array{Float64,2}:
0.149066 0.706264 0.983477 0.203822)
julia> X
3x4 Array{Float64,2}:
0.478816 0.0901912 0.810107 0.675179
0.149066 0.706264 0.983477 0.203822
0.73195 0.756805 0.345936 0.821917
Since this creates a pair of temporary arrays that we can't yet eliminate the allocation of, this isn't the most efficient approach. If you want the most efficient code here, looping over the two rows and swapping pairs of scalar values will be faster:
function swapRows!(X, i, j)
for k = 1:size(X,2)
X[i,k], X[j,k] = X[j,k], X[i,k]
end
end
Note that it is conventional in Julia to name functions that mutate one or more of their arguments with a trailing !. Currently, closures (i.e. inner functions) have some performance issues, so you'll want such a helper function to be defined at the top-level scope instead of inside of another function the way you've got it.
Finally, I assume this is an exercise since Julia ships with carefully tuned generic (i.e. it works for arbitrary numeric types) LU decomposition: http://docs.julialang.org/en/release-0.3/stdlib/linalg/#Base.lu.
-
It's quite simple
julia> A = rand(3,4)
3×4 Array{Float64,2}:
0.241426 0.283391 0.201864 0.116797
0.457109 0.138233 0.346372 0.458742
0.0940065 0.358259 0.260923 0.578814
julia> A[[1,2],:] = A[[2,1],:]
2×4 Array{Float64,2}:
0.457109 0.138233 0.346372 0.458742
0.241426 0.283391 0.201864 0.116797
julia> A
3×4 Array{Float64,2}:
0.457109 0.138233 0.346372 0.458742
0.241426 0.283391 0.201864 0.116797
0.0940065 0.358259 0.260923 0.578814

How does the recursion work in heap-sort?

Let's say I have an array A = <1,6,2,7,3,8,4,9,5>
pseudocode for Heapsort:
BUILD-MAX-HEAP(A)
n = A.heapsize = A.length
for i = floor(n/2) down to 1
MAX-HEAPIFY(A,i)
MAX-HEAPIFY(A,i)
n = A.heap-size
l = LEFT(i)
r = RIGHT(i)
if l <= n and A[l] > A[i]
largest = l
if r <= n and A[r] > A[largest]
largest = r
if largest != i
exchange A[i] <-> A[largest]
MAX-HEAPIFY(A,largest)
I know BUILD-MAX-HEAP will first call MAX-HEAPIFY(A,4), which will exchange 7 and 9, then after MAX-HEAPIFY (A,3) it will switch 8 and 2. Then it will call MAX-HEAPIFY(A,2), and this is where I get confused. This is how the heap looks when MAX-HEAPIFY(A,2) is called
The first thing that will happen is that 6 and 7 will be exchanged, then it will call MAX-HEAPIFY(A,4) (because 4 is now largest), and exchange 6 and 9, then it will call MAX-HEAPIFY(A,8) but nothing will happen because you've reached a leaf, so then it returns to the function that called it.
MAX-HEAPIFY(A-8) was called by MAX-HEAPIFY(A,4) so it returns to it
MAX-HEAPIFY(A,4) was called by MAX-HEAPIFY(A,2) so it returns to it
but now A[2] < A[4] (because 7 < 9), and it is at this point that I wonder how it knows to call MAX-HEAPIFY(A,2) again to exchange 7 and 9. When a recursive function (or subroutine) returns to the one that called it, there is no more code to be executed (since MAX-HEAPIFY only calls MAX-HEAPIFY at the end of the function), so it will return back up the recursion stack and in my mind it feels like 7 will still be the parent of 9
Sorry if this is confusing, but can somebody walk me through this to help me understand how exactly this is recursively max-heapifying itself?
The following is the series of steps I get when following your algorithm (note the levels of indent when we recurse at the end of each). Every time we exit the function, we just return to the main program (calling max_heapify with the numbers 4 down to 1). I'm not positive where your interpretation is off, but I'm hoping the following makes it clearer.
for i in (4,3,2,1):
MAX-HEAPIFY(A,i)
MAX-HEAPIFY(A,4):
largest=4 # initialized
l=8
r=9
largest=8 # calculated
swap A[4] and a[8]:
A = <1,6,2,9,3,8,4,7,5>
MAX-HEAPIFY(A, 8):
largest=8 # initialized
l=16
r=17
...return
...return
MAX-HEAPIFY(A,3):
largest=3 # initialized
l=6
r=7
largest=6 # calculated
swap A[3] and A[6]:
A = <1,6,8,9,3,2,4,7,5>
MAX-HEAPIFY(A, 6):
largest=6
l=12
r=13
...return
...return
MAX-HEAPIFY(A,2):
largest=2 # initialized
l=4
r=5
largest=4 # calculated
swap A[2]and A[4]:
A = <1,9,8,6,3,2,4,7,5>
MAX-HEAPIFY(A, 4):
largest=4 # initialized
l=8
r=9
largest=8
swap A[4] and A[8]:
A = <1,9,8,7,3,2,4,6,5>
MAX-HEAPIFY(A, 8):
largest=8 # initialized
l=16
r=17
...return
...return
...return
MAX-HEAPIFY(A,1):
largest=1 # initialized
l=2
r=3
largest=2 # calculated
swap A[1] and A[2]:
A = <9,1,8,7,3,2,4,6,5>
MAX-HEAPIFY(A, 2):
largest=2: # initialized
l=4
r=5
largest=4: # calculated
swap A[2] and A[4]:
A = <9,7,8,1,3,2,4,6,5>
MAX-HEAPIFY(A, 4):
largest=4: # initialized
l=8
r=9
largest=8: # calculated
swap A[4] and A[8]:
A = <9,7,8,6,3,2,4,1,5>
MAX-HEAPIFY(A, 8):
largest=8: # initialized
l=16
r=17
...return
...return
...return
...return
Done!
A = <9,7,8,6,3,2,4,1,5>
I then went so far as to translate your algorithm (almost directly) into python (note I had to make a few tweaks for python's 0-based index):
def build_max_heap(A):
for i in range(len(A)//2, 0, -1):
max_heapify(A, i)
def left(x):
return 2 * x
def right(x):
return 2 * x + 1
def max_heapify(A, i):
n = len(A)
largest = i
l = left(i)
r = right(i)
if l<=n and A[l-1] > A[i-1]:
largest = l
if r <=n and A[r-1] > A[largest-1]:
largest = r
if largest !=i:
A[i-1], A[largest-1] = A[largest-1], A[i-1]
max_heapify(A,largest)
if __name__ == '__main__':
A = [1,6,2,7,3,8,4,9,5]
build_max_heap(A) # modifies in-place
print(A)
This prints:
[9, 7, 8, 6, 3, 2, 4, 1, 5]
(which agrees with our manual iterations)
...and for one more check, using python's heapq module with its private method _heapify_max:
import heapq
A = [1,6,2,7,3,8,4,9,5]
heapq._heapify_max(A)
print(A)
...prints the same:
[9, 7, 8, 6, 3, 2, 4, 1, 5]

Resources