Dear All I tried to find an answer googling but I haven't been able to find an answer.
I'm using fftw in an MPI Fotran application and i need to compute forward and backward transform of a 3D array of tensor component by component, and while in fourier space compute some complex tensorial quantities.
In order to make the array used by ffftw useful and don't spend a lot of time moving data from an array to another one the option that came into my mind was to declare a 5d dimensional array: i.e
use, intrinsic :: iso_c_binding
call MPI_INIT( mpi_err )
call MPI_COMM_RANK( MPI_COMM_WORLD, mpi_rank, mpi_err )
call MPI_COMM_SIZE( MPI_COMM_WORLD, mpi_size, mpi_err )
integer(C_INTPTR_T), parameter :: FFTDIM=3 !fft dimension
integer(C_INTPTR_T) :: fft_L !x direction
integer(C_INTPTR_T) :: fft_M !y direction
integer(C_INTPTR_T) :: fft_N !z direction
complex(C_DOUBLE_COMPLEX), pointer :: fft_in(:,:,:,:,:), fft_out(:,:,:,:,:)
type(C_PTR) :: fft_plan_fwd, fft_plan_bkw, fft_datapointer
integer(C_INTPTR_T) :: fft_alloc_local, fft_local_n0, fft_local_0_start
include 'mpif.h'
include 'fftw3-mpi.f03'
call fftw_mpi_init
fft_L=problem_dim(1)
fft_M=problem_dim(2)
fft_N=problem_dim(3)
! CALCULATE LOCAL SIZE OF FFT VARIABLE FOR EACH COMPOENNT
fft_alloc_local = fftw_mpi_local_size_3d(fft_N,fft_M,fft_L, MPI_COMM_WORLD, &
fft_local_n0, fft_local_0_start)
! allocate data pointer
fft_datapointer = fftw_alloc_complex(9*int(fft_alloc_local,C_SIZE_T))
! link pointers to the same array
call c_f_pointer(fft_datapointer, fft_in, [ FFTDIM, FFTDIM, fft_L, fft_M, fft_local_n0])
call c_f_pointer(fft_datapointer, fft_out, [ FFTDIM, FFTDIM, fft_L, fft_M, fft_local_n0])
! create plans
fft_plan_fwd = fftw_MPI_plan_dft_3d(fft_N, fft_M, fft_L, & !dimension
fft_in(1,1,:,:,:), fft_out(1,1,:,:,:), & !inpu, output
MPI_COMM_WORLD, FFTW_FORWARD, FFTW_MEASURE)
fft_plan_bkw = fftw_MPI_plan_dft_3d(fft_N, fft_M, fft_L, & !dimension
fft_in(1,1,:,:,:), fft_out(1,1,:,:,:), & !inpu, output
MPI_COMM_WORLD, FFTW_BACKWARD, FFTW_MEASURE)
now if use this peace of code and the number of processors is a multiple of 2 (2,4,8...) everything works fine, but if I use for instance 6 the application will give an error. how could i solve this issue?
do you have any better strategies instead of allocating a 5d array and without moving to many data??
Thanks in advance
Andrea
I found the solution to this problem utilizing the fffw_mpi_plan_many interface
the code performing this computation follows here. It calculate a 3D(LxMxN) complex to complex transform of tensor component by component (11,12,...) utilizing MPI capabilities. The extent on the third dimension(N) must be divisible for the number of core utilized
program test_fftw
use, intrinsic :: iso_c_binding
implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
integer(C_INTPTR_T) :: L = 8 ! extent of x data
integer(C_INTPTR_T) :: M = 8 ! extent of y data
integer(C_INTPTR_T) :: N = 192 ! extent of z data
integer(C_INTPTR_T) :: FFT_12_DIM = 3 ! tensor dimension
integer(C_INTPTR_T) :: ll, mm, nn, i, j
complex(C_DOUBLE_COMPLEX) :: fout
! many plan data variables
integer(C_INTPTR_T) :: howmany=9 ! numer of eleemnt of the tensor
integer :: rank=3 ! rank of the transform
integer(C_INTPTR_T), dimension(3) :: fft_dims ! array containing data extent
integer(C_INTPTR_T) :: alloc_local_many, fft_local_n0, fft_local_0_start
complex(C_DOUBLE_COMPLEX), pointer :: fft_data(:,:,:,:,:)
type(C_PTR) ::fft_datapointer, plan_many
integer :: ierr, myid, nproc
! Initialize
call mpi_init(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call fftw_mpi_init()
! write data dimenion in reversed order
fft_dims(3) = L
fft_dims(2) = M
fft_dims(1) = N
! use of alloc many
alloc_local_many = fftw_mpi_local_size_many(rank, & ! rank of the transform in this case 3
fft_dims, & ! array containing data dimension in reversed order
howmany, & ! numebr of transform to compute in this case 3x3=9
FFTW_MPI_DEFAULT_BLOCK, & !default block size
MPI_COMM_WORLD, & ! mpi communicator
fft_local_n0, & ! local numebr of slice by core
fft_local_0_start) ! local shift on the last dimension
fft_datapointer = fftw_alloc_complex(alloc_local_many) ! allocate aligned memory for the data
! associate data pointer with allocated memory: note natural order
call c_f_pointer(fft_datapointer, fft_data, [FFT_12_DIM,FFT_12_DIM,L,M, fft_local_n0])
! create the plan for many inplace multidimensional transform
plan_many = fftw_mpi_plan_many_dft( &
rank , fft_dims, howmany, &
FFTW_MPI_DEFAULT_BLOCK, FFTW_MPI_DEFAULT_BLOCK, &
fft_data, fft_data, &
MPI_COMM_WORLD, FFTW_FORWARD, FFTW_ESTIMATE )
! initialize data to some function my_function(i,j)
do nn = 1, fft_local_n0
do mm = 1, M
do ll = 1, L
do i = 1, FFT_12_DIM
do j = 1, FFT_12_DIM
fout = ll*mm*nn*i*j
fft_data(i,j,ll,mm,nn) = fout
end do
end do
end do
end do
enddo
call fftw_mpi_execute_dft(plan_many, fft_data, fft_data)!
call fftw_destroy_plan(plan_many)
call fftw_mpi_cleanup()
call fftw_free(fft_datapointer)
call mpi_finalize(ierr)
end program test_fftw
thanks everyone for the help !!
Related
I've written a rudimentary algorithm in Fortran 95 to calculate the gradient of a function (an example of which is prescribed in the code) using central differences augmented with a procedure known as Richardson extrapolation.
function f(n,x)
! The scalar multivariable function to be differentiated
integer :: n
real(kind = kind(1d0)) :: x(n), f
f = x(1)**5.d0 + cos(x(2)) + log(x(3)) - sqrt(x(4))
end function f
!=====!
!=====!
!=====!
program gradient
!==============================================================================!
! Calculates the gradient of the scalar function f at x=0using a finite !
! difference approximation, with a low order Richardson extrapolation. !
!==============================================================================!
parameter (n = 4, M = 25)
real(kind = kind(1d0)) :: x(n), xhup(n), xhdown(n), d(M), r(M), dfdxi, h0, h, gradf(n)
h0 = 1.d0
x = 3.d0
! Loop through each component of the vector x and calculate the appropriate
! derivative
do i = 1,n
! Reset step size
h = h0
! Carry out M successive central difference approximations of the derivative
do j = 1,M
xhup = x
xhdown = x
xhup(i) = xhup(i) + h
xhdown(i) = xhdown(i) - h
d(j) = ( f(n,xhup) - f(n,xhdown) ) / (2.d0*h)
h = h / 2.d0
end do
r = 0.d0
do k = 3,M r(k) = ( 64.d0*d(k) - 20.d0*d(k-1) + d(k-2) ) / 45.d0
if ( abs(r(k) - r(k-1)) < 0.0001d0 ) then
dfdxi = r(k)
exit
end if
end do
gradf(i) = dfdxi
end do
! Print out the gradient
write(*,*) " "
write(*,*) " Grad(f(x)) = "
write(*,*) " "
do i = 1,n
write(*,*) gradf(i)
end do
end program gradient
In single precision it runs fine and gives me decent results. But when I try to change to double precision as shown in the code, I get an error when trying to compile claiming that the assignment statement
d(j) = ( f(n,xhup) - f(n,xhdown) ) / (2.d0*h)
is producing a type mismatch real(4)/real(8). I have tried several different declarations of double precision, appended every appropriate double precision constant in the code with d0, and I get the same error every time. I'm a little stumped as to how the function f is possibly producing a single precision number.
The problem is that f is not explicitely defined in your main program, therefore it is implicitly assumed to be of single precision, which is the type real(4) for gfortran.
I completely agree to the comment of High Performance Mark, that you really should use implicit none in all your fortran code, to make sure all object are explicitely declared. This way, you would have obtained a more appropriate error message about f not being explicitely defined.
Also, you could consider two more things:
Define your function within a module and import that module in the main program. It is a good practice to define all subroutines/functions within modules only, so that the compiler can make extra checks on number and type of the arguments, when you invoke the function.
You could (again in module) introduce a constant for the precicision and use it everywhere, where the kind of a real must be specified. Taking the example below, by changing only the line
integer, parameter :: dp = kind(1.0d0)
into
integer, parameter :: dp = kind(1.0)
you would change all your real variables from double to single precision. Also note the _dp suffix for the literal constants instead of the d0 suffix, which would automatically adjust their precision as well.
module accuracy
implicit none
integer, parameter :: dp = kind(1.0d0)
end module accuracy
module myfunc
use accuracy
implicit none
contains
function f(n,x)
integer :: n
real(dp) :: x(n), f
f = 0.5_dp * x(1)**5 + cos(x(2)) + log(x(3)) - sqrt(x(4))
end function f
end module myfunc
program gradient
use myfunc
implicit none
real(dp) :: x(n), xhup(n), xhdown(n), d(M), r(M), dfdxi, h0, h, gradf(n)
:
end program gradient
I have a vector whose length can go up to about a few million or more.
If I say the vector is vec = [a1 a2 ...b1 b2 ... c1 c2 ...d1 d2 ...]
I need to rearrange vec to new_vec = [a1 b1 c1 d1 a2 b2 c2 d2 ...] .
If viewed as a matrix of column vectors, then this could be viewed as a transpose, but I do not have a two dimensional vector. I understand how to do it on a sequential computer. That is very simple.
But I am not sure how to do this on a multiple processor cluster or on a GPU, or even if this would be feasible on parallel machines. Memory seems to be the obvious bottleneck. If there are any algorithms or any architecture specific optimizations that I can use, please let me know.
Edit: More information below.
The code structure is:
subroutine reorder(vec,parameter)
real(kind = 8),dimension(parameter%length), intent(inout) :: vec
real(kind = 8),dimension(parameter%length) :: temp
type(param) :: parameter !just a struct holding certain constant parameters
integer :: i,j,k,q1,q2,q3,nn1,n1,n2,nn2
i1 = parameter%len1 !lengths of sub-vectors in each direction
i2 = parameter%len2 !the multiplication of the 3 gives the
!overall length of vec
i3 = parameter%len3
temp = vec
n1 = i2*i1
n2 = i2*i3
do k = 1, i3
q1 = n1*(k-1)
q2 = i2*(k-1)
do j = 1, i2
q3 = i1*(j-1)
do i = 1, i1
nn1 = q1+q3+i
nn2 = q2+j+n2*(i-1)
vec(nn2) = temp(nn1)
end do
end do
end do
end subroutine reorder
Therefore the code aims to reorder the elements of the vector according to the rule. As you can see as the length of the vector becomes very large, a significant time is spent in this routine.
This routine is called in a main routine multiple times. A cartesian decomposition produces a cartesian 3D arrangement of ranks in the beginning and each rank calls this subroutine when a reordering of the elements is required for its own next subroutine call.The cartesian communicator subroutine is shown in the skeleton below:
subroutine cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)
use mpi
implicit none
integer, dimension(:), intent(in) :: ndim
integer, intent(out) :: comm_cart
integer, dimension(:), pointer :: comm_one_d, coord_cart
logical, dimension(size(ndim)) :: period, remain
integer :: dim,code, i, rank
!creating the cartesian communicator
dim = 3
allocate(comm_one_d(dim),coord_cart(dim))
period = .FALSE.
call MPI_CART_CREATE(MPI_COMM_WORLD, dim, ndim, period, .FALSE., comm_cart, code)
call MPI_COMM_RANK(comm_cart, rank, code)
call MPI_CART_COORDS(comm_cart, rank, dim, coord_cart, code)
!Creating sub-communicator for each direction
do i = 1, dim
remain = .FALSE.
remain(i) = .TRUE.
call MPI_CART_SUB(comm_cart, remain, comm_one_d(i), code)
end do
end subroutine cartesian_comm
And it is called in the main function as follows:
Program main
!initialize some stuff and intialize all the required variables
! ndim is the number of processes the program is called
! with "mpirun -np 8 ./exec" would mean ndim is cuberoot of 8,
! and therefore 2 for the 3D case. It is always made sure that
! np is a cube(or square for 2D) while calling the program
call cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)
do while (t<tend-1D-8) !start time loop
t = t + dt
!do some computations get the vector "vec" for
!each rank separately (different and independent in each rank)
call reorder(vec,parameter) ! all ranks call this subroutine
!do some computations here with the new reordered vec
end do !end time loop
!do other stuff (unrelated to reorder but use the "vec" vector) here
end Program main
I would like to know if there is a better way to do this in a multiprocessor cluster, or how I could proceed in the case of a GPU.
I wrote a recursive program on Fortran to calculate the combinations of npoints of ndim dimensions as follows. I first wrote this program on MATLAB and it was perfectly running. But in Fortran, my problem is that after the first iteration it is assigning absurd values for the list of points, with no explanation. Could somebody give me a hand?
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: ndim, k, npontos, contador,i,iterate, TEST
integer, dimension(:), allocatable :: pontos
print*, ' '
print*, 'npoints?'
read *, npontos
print*, 'ndim?'
read *, ndim
k=1
contador = 1
open(450,file= 'combination.out',form='formatted',status='unknown')
write(450,100) 'Comb ','stat ',(' pt ',i,' ',i=1,ndim)
write(450,120) ('XXXXXXXXXX ',i=1,ndim+1)
allocate(pontos(ndim))
do i=1,4
pontos(i)=i
end do
TEST = iterate(pontos, ndim, npontos,k,contador)
end program MAIN
recursive integer function iterate(pontos, ndim, npontos, k,contador)
implicit NONE
integer, intent(in) :: ndim, k, npontos
integer,dimension(:) :: pontos
integer contador,inic,i,j,m
if (k.eq.ndim) then
inic=pontos(ndim)
do i = pontos(ndim),npontos
pontos(k)= i
write(*,*) pontos(:)
contador=contador+1
end do
pontos(ndim)= inic + 1
else
inic = pontos (k)
do j = pontos(k),(npontos-ndim+k)
pontos(k)=j
pontos= iterate(pontos, ndim, npontos, k+1,contador)
end do
end if
pontos(k)=inic+1;
if (pontos(k).gt.(npontos-ndim+k+1)) then
do m =k+1,ndim
pontos(m)=pontos(m-1)+1
end do
end if
end function iterate
There are too many issues in that code... I stopped debugging it. This is what I got so far, it's too much for a comment.
This doesn't make sense:
pontos= iterate(pontos, ndim, npontos, k+1,contador)
You are changing pontos inside iterate, and never set a return value within the function.
You need to a) provide a result statement for recursive functions (and actually set it) or b) convert it to a subroutine. Since you are modifying at least one dummy argument, you should go with b).
Since you are using assumed-shape dummy arguments, you need to specify an interface to the function/subroutine, either explicitly or with a module.
Neither format 100 nor format 120 are specified in your code.
I have a program I want to parallelize using MPI. I have not worked with MPI before.
The program calculates the behavior for a large numer of objects over time. The data of
these objects is stored in arrays, e.g. double precision :: body_x(10000) for the x coordinate.
To calculate the behavior of an object the information about all other objects is needed,
so every thread needs to hold all data but will only update a portion of it. But before the
new timestep every thread needs to get the information from all other threads.
As I understand MPI_Allgather could be used for this, but it needs a send buffer and a
recive buffer. How can I synchronize an array over different threads if each thread updated
a different part of the array? Do I have to send the whole array from each thread to the
master in a recive buffer, update the specific part of the masters array and after all
threads have sent their data re-broadcast from the master?
This is a pretty basic question, but I'm very new to MPI and all examples I found are
pretty simple and do not cover this. Thanks for any help.
Pseudo-Example (assuming Fortran-Style vectors with first index 1):
(Yes the send/recive would better be done non-blocking, this is for the sake of simplicity)
if (master) then
readInputFile
end if
MPI_Bcast(numberOfObject)
allocate body_arrays(numberOfObjects)
if (master) then
fill body_arrays ! with the data from the input file
end if
MPI_Bcast(body_arrays)
objectsPerThread = numberOfObjects / threadCount
myStart = threadID * objectsPerThread + 1
myEnd = (threadID + 1) * objectsPerThread
do while (t < t_end)
do i = myStart, myEnd
do stuff for body_arrays(i)
end do
! here is the question
if (.not. master)
MPI_Send(body_arrays, toMaster)
else
do i = 1, threadCount - 1
MPI_Recive(body_arrays_recive, senderID)
body_arrays(senderID*objectsPerThread+1, (senderId+1)*objectsPerThread) = body_arrays_recive(senderID*objectsPerThread+1, (senderId+1)*objectsPerThread)
end if
MPI_Bcast(body_arrays)
! ----
t = t + dt
end do
It sounds like you want MPI_Allgather. To avoid needing a separate send buffer, you may be able to use the MPI_IN_PLACE value. That tells MPI to use the same buffer for both send and receive.
See http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node99.htm#Node99
The array chunks from all processes can be combined using a call to MPI_Allgatherv. The following is a complete example in Fortran. It defines an array of size 50. Then each process sets a chunk of that array to some complex number. Finally, the call to MPI_allgatherv pulls all the chunks together. The calculations of the chunk sizes, and some of the parameters that need to be passed to MPI_allgatherv are encapsulated in the mpi_split routine.
program test
use mpi
implicit none
integer, parameter :: idp = 8
integer, parameter :: n_tasks = 11
real(idp), parameter :: zero = 0.0d0
complex(idp), parameter :: czero = cmplx(zero, zero, kind=idp)
integer :: mpi_n_procs, mpi_proc_id, error
integer :: i, i_from, i_to
complex(idp) :: c(-5:5)
real(idp) :: split_size
integer, allocatable :: recvcount(:), displs(:)
call MPI_Init(error)
call MPI_Comm_size(MPI_COMM_WORLD, mpi_n_procs, error)
call MPI_Comm_rank(MPI_COMM_WORLD, mpi_proc_id, error)
allocate(recvcount(mpi_n_procs))
allocate(displs(mpi_n_procs))
i_from = -5
i_to = 5
! each process covers only part of the array
call mpi_split(i_from, i_to, counts=recvcount, displs=displs)
write(*,*) "ID", mpi_proc_id,":", i_from, "..", i_to
if (mpi_proc_id == 0) then
write(*,*) "Counts: ", recvcount
write(*,*) "Displs: ", displs
end if
c(:) = czero
do i = i_from, i_to
c(i) = cmplx(real(i, idp), real(i+1, idp), kind=idp)
end do
call MPI_Allgatherv(c(i_from), i_to-i_from+1, MPI_DOUBLE_COMPLEX, c, &
& recvcount, displs, MPI_DOUBLE_COMPLEX, MPI_COMM_WORLD, &
& error)
if (mpi_proc_id == 0) then
do i = -5, 5
write(*,*) i, ":", c(i)
end do
end if
deallocate(recvcount, displs)
call MPI_Finalize(error)
contains
!! #description: split the range (a,b) into equal chunks, where each chunk is
!! handled by a different MPI process
!! #param: a On input, the lower bound of an array to be processed. On
!! output, the lower index of the chunk that the MPI process
!! `proc_id` should process
!! #param: b On input, the upper bound of an array. On, output the
!! upper index of the chunk that process `proc_id` should
!! process.
!! #param: n_procs The total number of available processes. If not given,
!! this is determined automatically from the MPI environment.
!! #param: proc_id The (zero-based) process ID (`0 <= proc_id < n_procs`). If
!! not given, the ID of the current MPI process
!! #param: counts If given, must be of size `n_procs`. On output, the chunk
!! size for each MPI process
!! #param: displs If given, must be of size `n_procs`. On output, the offset
!! if the first index processed by each MPI process, relative
!! to the input value of `a`
subroutine mpi_split(a, b, n_procs, proc_id, counts, displs)
integer, intent(inout) :: a
integer, intent(inout) :: b
integer, optional, intent(in) :: n_procs
integer, optional, intent(in) :: proc_id
integer, optional, intent(inout) :: counts(:)
integer, optional, intent(inout) :: displs(:)
integer :: mpi_n_procs, n_tasks, mpi_proc_id, error
integer :: aa, bb
real(idp) :: split_size
logical :: mpi_is_initialized
mpi_n_procs = 1
if (present(n_procs)) mpi_n_procs = n_procs
mpi_proc_id = 0
if (present(proc_id)) mpi_proc_id = proc_id
if (.not. present(n_procs)) then
call MPI_Comm_size(MPI_COMM_WORLD, mpi_n_procs, error)
end if
if (.not. present(proc_id)) then
call MPI_Comm_rank(MPI_COMM_WORLD, mpi_proc_id, error)
end if
aa = a
bb = b
n_tasks = bb - aa + 1
split_size = real(n_tasks, idp) / real(max(mpi_n_procs, 1), idp)
a = nint(mpi_proc_id * split_size) + aa
b = min(aa + nint((mpi_proc_id+1) * split_size) - 1, bb)
if (present(counts)) then
do mpi_proc_id = 0, mpi_n_procs-1
counts(mpi_proc_id+1) = max(nint((mpi_proc_id+1) * split_size) &
& - nint((mpi_proc_id) * split_size), 0)
end do
end if
if (present(displs)) then
do mpi_proc_id = 0, mpi_n_procs-1
displs(mpi_proc_id+1) = min(nint(mpi_proc_id * split_size), bb-aa)
end do
end if
end subroutine mpi_split
end program
I have a program in R which calls a couple of Fortran routines, which are openMP-enabled. There are two Fortran routines sub_1 and sub_2. The first one is called twice in an R function, while the second is called once. Both routines are almost identical except for a few minor things. I call the first routine, then the second, then the first again. However, if I have both of them openMP-enabled, the function stops doing anything (doesn't have an error or stop execution, just sits there) when it gets to the second time it uses the first fortran routine.
If I disable the openMP in sub_1 then everything runs fine. If I instead disable the openMP in sub_2, then it again hangs in the same fashion on the second usage of sub_1. This is odd because it obviously gets through the first usage fine.
I thought it may be to do with the threads not closing properly or something (I don't know too much about openMP). However, another oddity is that the R function that calls these three routines is being called four times, and if I only enable openMP in sub_2, then this works fine (ie. the second, third etc. call to sub_2 doesn't hang). I just have no idea why it would do this! For reference, this is the code for sub_1:
subroutine correlation_dd_rad(s_bins,min_s,end_s,n,pos1,dd,r)
!!! INTENT IN !!!!!!!!
integer :: s_bins !Number of separation bins
integer :: N !Number of objects
real(8) :: pos1(3,N) !Cartesian Positions of particles
real(8) :: min_s !The smallest separation calculated.
real(8) :: end_s !The largest separation calculated.
real(8) :: r(N) !The radii of each particle (ascending)
!!! INTENT OUT !!!!!!!
real(8) :: dd(N,s_bins) !The binned data.
!!! LOCAL !!!!!!!!!!!!
integer :: i,j !Iterators
integer :: bin
real(8) :: d !Distance between particles.
real(8) :: dr,mins,ends
real(8),parameter :: pi = 3.14159653589
integer :: counter
dd(:,:) = 0.d0
dr = (end_s-min_s)/s_bins
!Perform the separation binning
mins = min_s**2
ends = end_s**2
counter = 1000
!$OMP parallel do private(d,bin,j)
do i=1,N
!$omp critical (count_it)
counter = counter - 1
!$omp end critical (count_it)
if(counter==0)then
counter = 1000
write(*,*) "Another Thousand"
end if
do j=i+1,N
if(r(j)-r(i) .GT. end_s)then
exit
end if
d=(pos1(1,j)-pos1(1,i))**2+&
&(pos1(2,j)-pos1(2,i))**2+&
&(pos1(3,j)-pos1(3,i))**2
if(d.LT.ends .AND. d.GT.mins)then
d = Sqrt(d)
bin = Floor((d-min_s)/dr)+1
dd(i,bin) = dd(i,bin)+1.d0
dd(j,bin) = dd(j,bin)+1.d0
end if
end do
end do
!$OMP end parallel do
write(*,*) "done"
end subroutine
Does anyone have any clue why this would happen??
Cheers.
I'll add in the smallest example that I can think of that does reproduce the problem (by the way, this must be an R problem - a small example of the type that I present here but written in fortran works fine). So I have the above code and the following code in fortran, compiled to the shared object correlate.so:
subroutine correlation_dr_rad(s_bins,min_s,end_s,n,pos1,n2,pos2,dd,r1,r2)
!!! INTENT IN !!!!!!!!
integer :: s_bins !Number of separation bins
integer :: N !Number of objects
integer :: n2
real(8) :: pos1(3,N) !Cartesian Positions of particles
real(8) :: pos2(3,n2) !random particles
real(8) :: end_s !The largest separation calculated.
real(8) :: min_s !The smallest separation
real(8) :: r1(N),r2(N2) !The radii of particles (ascending)
!!! INTENT OUT !!!!!!!
real(8) :: dd(N,s_bins) !The binned data.
!!! LOCAL !!!!!!!!!!!!
integer :: i,j !Iterators
integer :: bin
real(8) :: d !Distance between particles.
real(8) :: dr,mins,ends
real(8),parameter :: pi = 3.14159653589
integer :: counter
dd(:,:) = 0.d0
dr = (end_s-min_s)/s_bins
!Perform the separation binning
mins = min_s**2
ends = end_s**2
write(*,*) "Got just before parallel dr"
counter = 1000
!$OMP parallel do private(d,bin,j)
do i=1,N
!$OMP critical (count)
counter = counter - 1
!$OMP end critical (count)
if(counter==0)then
write(*,*) "Another thousand"
counter = 1000
end if
do j=1,N2
if(r2(j)-r1(i) .GT. end_s)then
exit
end if
d=(pos1(1,j)-pos2(1,i))**2+&
&(pos1(2,j)-pos2(2,i))**2+&
&(pos1(3,j)-pos2(3,i))**2
if(d.GT.mins .AND. d.LT.ends)then
d = Sqrt(d)
bin = Floor((d-min_s)/dr)+1
dd(i,bin) = dd(i,bin)+1.d0
end if
end do
end do
!$OMP end parallel do
write(*,*) "Done"
end subroutine
Then in R, I have the following functions - the first two just wrap the above fortran code. The third calls it in a similar way to my actual code:
correlate_dd_rad = function(pos,r,min_r,end_r,bins){
#A wrapper for the fortran routine of the same name.
dyn.load('correlate.so')
out = .Fortran('correlation_dd_rad',
s_bins = as.integer(bins),
min_s = as.double(min_r),
end_s = as.double(end_r),
n = as.integer(length(r)),
pos = as.double(t(pos)),
dd = matrix(0,length(r),bins), #The output matrix.
r = as.double(r))
dyn.unload('correlate.so')
return(out$dd)
}
correlate_dr_rad = function(pos1,r1,pos2,r2,min_r,end_r,bins){
#A wrapper for the fortran routine of the same name
N = length(r1)
N2 = length(r2)
dyn.load('correlate.so')
out = .Fortran('correlation_dr_rad',
s_bins = as.integer(bins),
min_s = as.double(min_r),
end_s = as.double(end_r),
n = N,
pos1 = as.double(t(pos1)),
n2 = N2,
pos2 = as.double(t(pos2)),
dr = matrix(0,nrow=N,ncol=bins),
r1 = as.double(r1),
r2 = as.double(r2))
dyn.unload('correlate.so')
return(out$dr)
}
the_calculation = function(){
#Generate some data to use
pos1 = matrix(rnorm(30000),10000,3)
pos2 = matrix(rnorm(30000),10000,3)
#Find the radii
r1 = sqrt(pos1[,1]^2 + pos1[,2]^2+pos1[,3]^2)
r2 = sqrt(pos2[,1]^2 + pos2[,2]^2+pos2[,3]^2)
#usually sort them but it doesn't matter here.
#Now call the functions
print("Calculating the data-data pairs")
dd = correlate_dd_rad(pos=pos1,r=r1,min_r=0.001,end_r=0.8,bins=15)
print("Calculating the data-random pairs")
dr = correlate_dr_rad(pos1,r1,pos2,r2,min_r=0.001,end_r=0.8,bins=15)
print("Calculating the random-random pairs")
rr = correlate_dd_rad(pos=pos2,r=r2,min_r=0.001,end_r=0.8,bins=15)
#Now we would do something with it but I don't care in this example.
print("Done")
}
Running this I get the output:
[1] "Calculating the data-data pairs"
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
Another Thousand
done
[1] "Calculating the data-random pairs"
Got just before parallel dr
Another thousand
Another thousand
And then it just sits there... Actually, running it a few times has shown that it changes where it hangs each time. Sometimes it gets most of the way through the second call to correlate_dd_rad and others it only gets halfway through the call to correlate_dr_rad.
I am not sure if this will solve your problem, but it is indeed a bug. In subroutine correlation_dd_rad when you intended to close the parallel region, you actually put a comment. To be more clear the line that reads:
!OMP end parallel do
should be converted to:
!$OMP end parallel do
As side notes:
you don't need to use omp_lib if you don't call the library functions
you can use the atomic construct (see section 2.8.5 of the latest OpenMP specifications) to access a specific storage location atomically, instead of a critical construct
always give a name to critical constructs as (section 2.8.2 of the specifications)
All critical constructs without a name are considered to have the same unspecified name.