I am trying to parallelize a fortran loop via OpenMP. The loop in essence consists of only two commands:
do i=1,LSample
calcSslice(Vpot(:,:,i), Sslice)
rpold = rp
combine_rp_matrices (rpold, Sslice, rp)
end do
the calcSslice subroutine reads Vpot(:,:,i), performs some calculations and stores the results in the matrix Sslice. the combine_rp_matrices uses rpold and Sslice to update rp. rp acts as a running variable and is the desired output of the program. The order in which the Sslice-matrices from different iterations are combined with rp is irrelevant. My first attempt at parallelizing this loop looked like this:
!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(Sslice), SCHEDULE(DYNAMIC)
do i=1,LSample
calcSslice(Vpot(:,:,i), Sslice)
!$OMP CRITICAL
rpold = rp
combine_rp_matrices (rpold, Sslice, rp)
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO
This compiles and runs, but produces wrong results. Using the following code I get correct results but much slower execution (albeit still faster than serialized code):
!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(Sslice), SCHEDULE(DYNAMIC)
do i=1,LSample
!$OMP CRITICAL(Crit2)
calcSslice(Vpot(:,:,i), Sslice)
!$OMP END CRITICAL(Crit2)
!$OMP CRITICAL
rpold = rp
combine_rp_matrices (rpold, Sslice, rp)
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO
So there is apparently some synchronization issue with calcSslice. However, I don't quite understand where this would occur. Vpot is only read from and not written to in calcSslice and Sslice is a threadprivate variable. Any global variables used in calcSslice are also only read from. The variables rpold and rp are declared in the scope of the subroutine which the DO loop is part of, and so cannot be accessed by calcSslice. The variables declared in calcSslice use the following attributes: intent(in), intent(out), target, pointer.
Where does this go wrong?
EDIT: The problem is solved, cause was the initialization of variables in calcSslice during declaration, which implies the save attribute.
My guess would be that calcSslice is not threadsafe. Make sure this subroutine does not access global variables other than read-only and do not use the save attribute (beware the implicit save if you initialize variables during declaration!).
You could use a threadchecker like the one provided by Intel to find race-conditions in your code. If you have no access to such software, I would start with a dummy procedure and then populate the routine incrementaly to see where it fails.
Another thing that puzzles me is the last two lines of the loop body. Every thread backups the whole matrix and then adds his slice. Wouldn't it be better to collect all slices (e.g. by a reduction clause) and then combine that large slice once?
Related
As a novice to Julia, I, like many others, am perplexed by the fact that loops in Julia create their own local scope (but not on the REPL nor within functions). There is much discussion online about this topic, but most of the questions here are about the particulars of this behaviour, such as why doing a=1 inside the loop doesn't affect variable a outside the loop, but a[1]=1 works. I get how it works now, for the most part.
My question is why was Julia implemented with this behaviour. Is there a benefit to this from the prespective of the user? I cannot think of one. Or was it necessary for some technical reason?
I appologise if this has been asked already, but all the questions and answers I've seen so far were about how this works and how to deal with it, but I am curious about WHY Julia was implemented this way.
Firstly, loops in Julia only introduce a new scope of the sort that hides variables existing outside the loop (as per your complaint) if the scope outside the loop is global scope. So, for instance
function foo()
a = 0
# Loop does not hide existing variable `a`, will work just fine
for i = 100
a += i^2
end
return a
end
julia> foo()
10000
in other words
# Anywhere other than global scope
a = 0
for i = 100
a += i^2
end
a == 10000 # TRUE
This is because in Julia, as in many many other languages, global scope may be considered harmful. At the very least, a for loop with global scope would encounter significant performance penalties. For instance, consider the following:
julia> a = 0
0
julia> #time for i=1:100
# Technically this "global" keyword is superfluous since we're running this at the repl, but doesn't hurt to be explicit
global a += rand()^2
end
0.000022 seconds (200 allocations: 3.125 KiB)
julia> function bar()
a = 0
for i=1:100
a += rand()^2
end
return a
end
bar (generic function with 1 method)
julia> #time bar()
0.000002 seconds
33.21364180865362
Note the massive difference in allocations (the bottom version has zero) and the ~10x time difference.
Now, you may have noticed I used a special keyword global there in the global example, but since this was being run in the REPL, that doesn't actually do anything other than make it explicit what is happening.
That brings us to the other significant difference you have noticed: when run in the REPL, for loops appear not to introduce a new scope, even though the REPL is certainly global scope. This is because it turns out to be a huge pain when debugging to have to add a bunch of global qualifiers to code you have copy-pasted from somewhere deeper in your program (say within a function, where loops do not hide outside variables). So for the sake of convenience when debugging, the REPL effectively adds those global keywords for you, making the presumption that if you cared about performance you wouldn't just be pasting raw loops into the REPL, and if you are just pasting raw loops into the REPL, you're probably debugging or something.
In a script, however, it is presumed that you do care about performance, so you will get an error if you try to use a global variable within a loop without explicitly declaring it as such.
The details are substantially more complicated, as the other answer explains in more technically correct terms. Some of this complication, as far as I know, is due to a reversal on the decision of whether or not global variables should or should not be accessible by default within a loop in the REPL that happened around the time of Julia v0.7.
for loops in Julia introduce a so called local (soft) scope, see https://docs.julialang.org/en/v1/manual/variables-and-scoping/#man-scope-table.
The rules for local (soft) scope are (quoting):
If x is not already a local variable and all of the scope constructs containing the assignment are soft scopes (loops, try/catch blocks, or struct blocks), the behavior depends on whether the global variable x is defined:
if global x is undefined, a new local named x is created in the scope of the assignment;
if global x is defined, the assignment is considered ambiguous:
in non-interactive contexts (files, eval), an ambiguity warning is printed and a new local is created;
in interactive contexts (REPL, notebooks), the global variable x is assigned.
So your statement:
why doing a=1 inside the loop doesn't affect variable a outside the loop
is only true in non-interactive contexts if the for loop is not inside a hard local scope (typically if for loop is in a global scope), and the variable you assign to is defined in global scope. However, you will get a warning then.
Now the crucial part of your question is I think:
My question is why was Julia implemented with this behaviour. Is there a benefit to this from the prespective of the user?
The answer is that for loop creates a new binding for a variable that is defined within its scope. To see the consequence consider the following code (I assume that variable x is not defined in enclosing scope so that x is defined in local scope):
julia> v = []
Any[]
julia> for i in 1:2
x = i
push!(v, () -> x)
end
julia> v[1]()
1
julia> v[2]()
2
Whe have created two anonymous functions and all works as you probably expected.
Now let us check what would happen in Python:
>>> v = []
>>> for i in range(1, 3):
... x = i
... v.append(lambda: x)
...
>>> v[0]()
2
>>> v[1]()
2
The result might surprise you. Both anonymous functions return 2. This is a consequence of not creating a local variable with a new binding in each iteration of the loop.
However, if in Julia you were working in REPL and x were defined in global scope you would get:
julia> x = 0
0
julia> v = []
Any[]
julia> for i in 1:2
x = i
push!(v, () -> x)
end
julia> v[1]()
2
julia> v[2]()
2
just like in Python.
The other consideration, as explained in the other answer is performance. But most likely performance critical code is written inside a function anyway, and the discussed performance considerations are only relevant in global scope.
EDIT
This is a design choice of Matlab, quoting from https://research.wmz.ninja/articles/2017/05/closures-in-matlab.html:
When an anonymous function is created, the immediate values of the referenced local variables will be captured. Hence if any changes to the referenced local variables made after the creation of this anonymous function will not affect this anonymous function.
So as you can see in Matlab there is a difference of anonymous function vs. a closure, which does something different:
When a nested function is created, the immediate values of the referenced local variables will not be captured. When the nested function is called, it will use the current values of the referenced local variables.
In Julia there is no such difference as you can see in the examples above.
And quoting the documentation of Matlab https://www.mathworks.com/help/matlab/matlab_prog/anonymous-functions.html:
Because a, b, and c are available at the time you create parabola, the function handle includes those values. The values persist within the function handle even if you clear the variables:
(but I think it is not as explicit as the explanation I linked above)
I have some Julia functions that are several hundreds of lines long that I would like to profile so that I can then work on optimizing the code.
I am aware of the BenchmarkTools package which allows the overall execution time and memory consumption of a function to be measured using #btime or #benchmark. But those functions tell me nothing about where inside the functions the bottlenecks are. So my first step would have to be using some tool to identify which parts of the code are slow.
In Matlab for instance there is a very nice built-in profiler which runs a script/function and then reports the time spent on every line of the code. Similarly in Python there is a module called line_profiler which can produce a line-by-line report showing how much time was spent on every single line of a function.
What I’m looking for is simply a line-by-line report showing the total time spent on each line of code and how many times a particular piece of code was called.
Is there such a functionality in Julia? Either built-in or via some third-party package.
There is a Profiling chapter in Julia docs with all the necessary info.
Also, you can use ProfileView.jl or similar packages for visual exploration of the profiled code.
And, not exactly profiling, but very useful in practice package is TimerOutputs.jl
UPD: Since Julia is a compiling language it makes no sense to measure timing of individual lines, since the actual code that is executed can be very different from what is written in Julia.
For example following julia code
function f()
x = 0
for i in 0:100_000_000
x += i
end
x
end
is lowered to
julia> #code_llvm f()
; # REPL[8]:1 within `f'
define i64 #julia_f_594() {
top:
; # REPL[8]:7 within `f'
ret i64 5000000050000000
}
I.e. there is no loop at all. This is why instead of execution time proxy metric of how often a line appears in the set of all backtraces is used. Of course, it is not the same as the execution time, but it gives a good approximation of where the bottleneck is because lines with long execution time appear in backtraces more often.
OwnTime.jl. Doesn't do call counts though, but it should be easy to add.
I am often confronted with the following situation when I debug my Julia code:
I suspect that a certain variable (often a large matrix) deep inside my code is not what I intended it to be and I want to have a closer look at it. Ideally, I want to have access to it in the REPL so I can play around with it.
What is the best practice to get access to variables several function layers deep without passing them up the chain, i.e. changing the function returns?
Example:
function multiply(u)
v = 2*u
w = subtract(v)
return w
end
function subtract(x)
i = x-5
t = 10
return i-3t
end
multiply(10)
If I run multiply() and suspect that the intermediate variable i is not what I assume it should be, how would I gain access to it in the REPL?
I know that I could just write a test function and test that i has the intended properties right inside subtract(), but sometimes it would just be quicker to use the REPL.
This is the same in any programming language. You can use debugging tools like ASTInterpreter2 (which has good Juno integration) to step through your code and have an interactive REPL in the current environment, or you can use println debugging where you run the code with #show commands in there to print out values.
I have the following code:
mutable struct T
a::Vector{Int}
end
N = 100
len = 10
Ts = [T(fill(i, len)) for i in 1:N]
for i = 1:N
destructive_function!(Ts[i])
end
destructive_function! changes a in each element of Ts.
Is there any way to parallelize the for loop? SharedArray does not seem available for user-defined types.
In this case it would be simpler to use threads rather than processes to parallelize the loop
Threads.#threads for i = 1:N
destructive_function!(Ts[i])
end
Just make sure that you set JULIA_NUM_THREADS environment variable before starting Julia. You can check number of threads running with Threads.nthreads() function.
If you want to use processes rather than threads then probably it is simplest to make destructive_function! return the modified value and then run pmap(destructive_function!, Ts) to collect the modified values in a new array (but this would not modify the Ts in place).
This may be a specific question, but I think it pertains to how memory is handled with these two compilers (Compaq visual Fortran Optimizing Compiler Version 6.5 and minGW). I am trying to get an idea of best practices with using pointers in Fortran 90 (which I must use). Here is an example code, which should work "out of the box" with one warning from a gfortran compiler: "POINTER valued function appears on RHS of assignment", and no warnings from the other compiler.
module vectorField_mod
implicit none
type vecField1D
private
real(8),dimension(:),pointer :: x
logical :: TFx = .false.
end type
contains
subroutine setX(this,x)
implicit none
type(vecField1D),intent(inout) :: this
real(8),dimension(:),target :: x
logical,save :: first_entry = .true.
if (first_entry) nullify(this%x); first_entry = .false.
if (associated(this%x)) deallocate(this%x)
allocate(this%x(size(x)))
this%x = x
this%TFx = .true.
end subroutine
function getX(this) result(res)
implicit none
real(8),dimension(:),pointer :: res
type(vecField1D),intent(in) :: this
logical,save :: first_entry = .true.
if (first_entry) nullify(res); first_entry = .false.
if (associated(res)) deallocate(res)
allocate(res(size(this%x)))
if (this%TFx) then
res = this%x
endif
end function
end module
program test
use vectorField_mod
implicit none
integer,parameter :: Nx = 15000
integer :: i
real(8),dimension(Nx) :: f
type(vecField1D) :: f1
do i=1,10**4
f = i
call setX(f1,f)
f = getX(f1)
call setX(f1,f)
if (mod(i,5000).eq.1) then
write(*,*) 'i = ',i,f(1)
endif
enddo
end program
This program runs in both compilers. However, changing the loop from 10**4 to 10**5 causes a serious memory problem with gfortran.
Using CTR-ALT-DLT, and opening "performance", the physical memory increases rapidly when running in gfortran, and doesn't seem to move for the compaq compiler. I usually cancel before my computer crashes, so I'm not sure of the behavior after it reaches the maximum.
This doesn't seem to be the appropriate way to use pointers (which I need in derived data types). So my question is: how can I safely use pointers while maintaining the same sort of interface and functionality?
p.s. I know that the main program does not seem to do anything constructive, but the point is that I don't think that the loop should be limited by the memory, but rather it should be a function of run-time.
Any help is greatly appreciated.
This code has a few problems, perhaps caused by misunderstandings around the language. These problems have nothing to do with the specific compiler - the code itself is broken.
Conceptually, note that:
There is one and only one instance of a saved variable in a procedure per program in Fortran 90.
The variable representing the function result always starts off undefined each time a function is called.
If you want a pointer in a calling scope to point at the result of a function with a pointer result, then you must use pointer assignment.
If a pointer is allocated, you need to have a matching deallocate.
There is a latent logic error, in that the saved first_entry variables in the getX and setX procedures are conflated with object specific state in the setX procedure and procedure instance specific state in the getX procedure.
The first time setX is ever called the x pointer component of the particular this object will be nullified due to the if statement (there's an issue of poor style there too - be careful having multiple statements after an if statement - it is only the first one that is subject to the conditional!). If setX is then called again with a different this, first_entry will have been set to false and the this object will not be correctly set-up. I suspect you are supposed to be testing this%TFX instead.
Similarly, the first time getX is called the otherwise undefined function result variable res will be nullified. However, in all subsequent calls the function result will not be nullified (the function result starts off undefined each execution of the function) and will then be erroneously used in an associated test and also perhaps erroneously in a deallocate statement. (It is illegal to call associated (or deallocate for that matter) on a pointer with an undefined association status - noting that an undefined association status is not the same thing as dissociated.)
getX returns a pointer result - one that is created by the pointer being allocated. This pointer is then lost because "normal" assignment is used to access the value that results from evaluating the function. Because this pointer is lost there can't be (and so there isn't...) a matching deallocate statement to reverse the pointer allocation. The program therefore leaks memory. What almost certainly should be happening is that the thing that captures the value of the getX function in the main program (f in this case, but f is used for multiple things, so I'll call it f_ptr...) itself should be a pointer, and it should be pointer assigned - f_ptr => getX(f1). After the value of f_ptr has been used in the subsequent setX call and write statement, it can then be explicitly deallocated.
The potential for accidental use of normal assignment when pointer assignment is intended is one of the reasons that use of functions with pointer results is discouraged. If you need to return a pointer - then use a subroutine.
Fortran 95 simplifies management of pointer components by allowing default initialization of those components to NULL. (Note that you are using default initialization in your type definition - so your code isn't Fortran 90 anyway!)
Fortran 2003 (or Fortran 95 + the allocatable TR - which is a language level supported by most maintained compilers) introduces allocatable function results - which remove many of the potential errors that can otherwise be made using pointer functions.
Fortran 95 + allocatable TR support is so ubiquitous these days and the language improvements and fixes made to that point are so useful that (unless you are operating on some sort of obscure platform) limiting the language level to Fortran 90 is frankly ridiculous.