I'm getting some seemingly odd results when I try to update a float3 variable in an OpenCL kernel. Boiling it down:
float3 vel = float3( 0 ); // a
vel = float3( 0, 1, 0 ); // b
vel = (float3)( 0, 2, 0 ); // c
If I print vel after each call with:
if( get_global_id( 0 ) == 0 )
printf( "[%d]: vel: ( %f, %f, %f )\n", index, vel.x, vel.y, vel.z );
Then I see that a) correctly initializes vel, however b) doesn't do anything. c) works. Does anyone know why I can't update the variable with a new float3 object as I'm doing in b? This is how I'm used to doing it in C++ and glsl. Or possibly a driver bug?
Using OpenCL 1.2 on macbook pro running OS X 10.11.5.
Only c) is a correct way of initialising/using vector types. a) and b) is possibly a bug in Mac implementation (on 2 GPUs and 1CPU I tried that didn't compile).
Few ways of initialising vector type:
float3 vel = (float3)( 1,1,1 );
float3 vel2 = (float3) 1; // it will be ( 1,1,1 )
float3 vel3 = 1; // it will be ( 1,1,1 )
More about usage of vector types: spec
In C, the comma operator has very specific semantics, as described in detail here (or more briefly here):
(a, b, c) is a sequence of expressions, separated by commas, which evaluates to the last expression c
For your second approach (b), this means that your expression will boil down to float(0), which gives you the same result as in the first approach (a).
Your third approach (c) is making use of specific syntax introduced by OpenCL C which allows for initialising individual elements of a vector, and so doesn't fall into this trap.
Related
Some cases of periodic boundary conditions (PBC) can be imposed very efficiently on integers by simply doing:
myWrappedWithinPeriodicBoundary = myUIntValue & mask
This works when the boundary is the half open range [0, upperBound), where the (exclusive) upperBound is 2^exp so that
mask = (1 << exp) - 1
For example:
let pbcUpperBoundExp = 2 // so the periodic boundary will be [0, 4)
let mask = (1 << pbcUpperBoundExp) - 1
for x in -7 ... 7 { print(x & mask, terminator: " ") }
(in Swift) will print:
1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Question: Is there any (roughly similar) efficient method for imposing (some cases of) PBCs on floating point-numbers (32 or 64-bit IEEE-754)?
There are several reasonable approaches:
fmod(x,1)
modf(x,&dummy) — has the advantage of knowing its divisor statically, but in my testing comes from libc.so.6 even with -ffast-math
x-floor(x) (suggested by Jens in a comment) — supports negative inputs directly
Manual bit-twiddling direct implementation
Manual bit-twiddling implementation of floor
The first two preserve the sign of their input; you can add 1 if it's negative.
The two bit manipulations are very similar: you identify which significand bits correspond to the integer portion, and mask them (for the direct implementation) or the rest (to implement floor) off. The direct implementation can be completed either with a floating-point division or with a shift to reassemble the double manually; the former is 28% faster even given hardware CLZ. The floor implementation can immediately reconstitute a double: floor never changes the exponent of its argument unless it returns 0. About 20 lines of C are required.
The following timing is with double and gcc -O3, with timing loops over representative inputs into which the operative code was inlined.
fmod: 41.8 ns
modf: 19.6 ns
floor: 10.6 ns
With -ffast-math:
fmod: 26.2 ns
modf: 30.0 ns
floor: 21.9 ns
Bit manipulation:
direct: 18.0 ns
floor: 20.6 ns
The manual implementations are competitive, but the floor technique is the best. Oddly, two of the three library functions perform better without -ffast-math: that is, as a PLT function call than as an inlined builtin function.
I'm adding this answer to my own question since it describes the, at the time of writing, best solution I have found. It's in Swift 4.1 (should be straight forward to translate into C) and it's been tested in various use cases:
extension BinaryFloatingPoint {
/// Returns the value after restricting it to the periodic boundary
/// condition [0, 1).
/// See https://forums.swift.org/t/why-no-fraction-in-floatingpoint/10337
#_transparent
func wrappedToUnitRange() -> Self {
let fract = self - self.rounded(.down)
// Have to clamp to just below 1 because very small negative values
// will otherwise return an out of range result of 1.0.
// Turns out this:
if fract >= 1.0 { return Self(1).nextDown } else { return fract }
// is faster than this:
//return min(fract, Self(1).nextDown)
}
#_transparent
func wrapped(to range: Range<Self>) -> Self {
let measure = range.upperBound - range.lowerBound
let recipMeasure = Self(1) / measure
let scaled = (self - range.lowerBound) * recipMeasure
return scaled.wrappedToUnitRange() * measure + range.lowerBound
}
#_transparent
func wrappedIteratively(to range: Range<Self>) -> Self {
var v = self
let measure = range.upperBound - range.lowerBound
while v >= range.upperBound { v = v - measure }
while v < range.lowerBound { v = v + measure }
return v
}
}
On my MacBook Pro with a 2 GHz Intel Core i7,
a hundred million (probably inlined) calls to wrapped(to range:) on random (finite) Double values takes 0.6 seconds, which is about 166 million calls per second (not multi threaded). The range being statically known or not, or having bounds or measure that is a power of two etc, can make some difference but not as much as one could perhaps have thought.
wrappedToUnitRange() takes about 0.2 seconds, meaning 500 million calls per second on my system.
Given the right scenario, wrappedIteratively(to range:) is as fast as wrappedToUnitRange().
The timings have been made by comparing a baseline test (without wrapping some value, but still using it to compute eg a simple xor checksum) to the same test where a value is wrapped. The difference in time between these are the times I have given for the wrapping calls.
I have used Swift development toolchain 2018-02-21, compiling with -O -whole-module-optimization -static-stdlib -gnone. And care has been taken to make the tests relevant, ie preventing dead code removal, using true random input of different distributions etc. Writing the wrapping functions generically, like this extension on BinaryFloatingPoint, turned out to be optimized into equivalent code as if I had written separate specialized versions for eg Float and Double.
It would be interesting to see someone more skilled than me investigating this further (C or Swift or any other language doesn't matter).
EDIT:
For anyone interested, here is some versions for simd float2:
extension float2 {
#_transparent
func wrappedInUnitRange() -> float2 {
return simd.fract(self)
}
#_transparent
func wrappedToMinusOneToOne() -> float2 {
let scaled = (self + float2(1, 1)) * float2(0.5, 0.5)
let scaledFract = scaled - floor(scaled)
let wrapped = simd_muladd(scaledFract, float2(2, 2), float2(-1, -1))
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let oneNextDown = Float(bitPattern:
0b0_01111110_11111111111111111111111)
let oneNextDownFloat2 = float2(oneNextDown, oneNextDown)
return simd.min(wrapped, oneNextDownFloat2)
}
#_transparent
func wrapped(toLowerBound lowerBound: float2,
upperBound: float2) -> float2
{
let measure = upperBound - lowerBound
let recipMeasure = simd_precise_recip(measure)
let scaled = (self - lowerBound) * recipMeasure
let scaledFract = scaled - floor(scaled)
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let wrapped = simd_muladd(scaledFract, measure, lowerBound)
let maxX = upperBound.x.nextDown // For some reason, this won't be
let maxY = upperBound.y.nextDown // optimized even when upperBound is
// statically known, and there is no similar simd function available.
let maxValue = float2(maxX, maxY)
return simd.min(wrapped, maxValue)
}
}
I asked some related simd-related questions here which might be of interest.
EDIT2:
As can be seen in the above Swift Forums thread:
// Note that tiny negative values like:
let x: Float = -1e-08
// May produce results outside the [0, 1) range:
let wrapped = x - floor(x)
print(wrapped < 1.0) // false
// which may result in out-of-bounds table accesses
// in common usage, so it's probably better to use:
let correctlyWrapped = simd_fract(x)
print(correctlyWrapped < 1.0) // true
I have since updated the code to account for this.
I have a simple kernel in OpenCL that has the following structure:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = select((float)0.0, (float)1.0, (int)(input[i] != 0.0));
//output[i] = select((float)0.0, (float)1.0, 1);
}
Equivalently this can be:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = input[i] != 0.0 ? 1.0 : 0.0;
//output[i] = 1 ? 1.0 : 0.0;
}
When I print to the command line, I see:
input 1
input 1
input 1
But the output array has all 0.0. However, if I uncomment the last line of the kernel and comment out the second-to-last-line (meaning if I use the scalar 1 in the select statement) then it works as expected and the output array has all 1.0. So what is the difference between these two lines that leads to two different results?
Here is the answer.
It's a quirk in OpenCL. The problem is that true/false values for scalars are 1/0 (like printf has shown you), but true/false values for vectors are -1/0 - and this is also what select() expects in last argument (more precisely, it expects MSB set which means any negative integer).
Though i think the ternary operator on scalars should still work as expected, if it doesn't i would consider it a bug.
I cannot solve a problem in Scilab because it get stucked because of round-off errors. I get the message
!--error 9999
Error: Round-off error detected, the requested tolerance (or default) cannot be achieved. Try using bigger tolerances.
at line 2 of function scalpol called by :
at line 7 of function gram_schmidt_pol called by :
gram_schmidt_pol(a,-1/2,-1/2)
It's a Gram Schmidt process with the integral of the product of two functions and a weight as the scalar product, between -1 and 1.
gram_schmidt_pol is the process specially designed for polynome, and scalpol is the scalar product described for polynome.
The a and b are parameters for the weigth, which is (1+x)^a*(1-x)^b
The entry is a matrix representing a set of vectors, it works well with the matrix [[1;2;3],[4;5;6],[7;8;9]], but it fails with the above message error on matrix eye(2,2), in addition to this, I need to do it on eye(9,9) !
I have looked for a "tolerance setting" in the menus, there is some in General->Preferences->Xcos->Simulation but I believe this is not for what I wan't, I have tried low settings (high tolerance) in it and it hasn't change anything.
So how can I solve this rounf-off problem ?
Feel free to tell me my message lacks of clearness.
Thank you.
Edit: Code of the functions :
// function that evaluate a polynomial (vector of coefficients) in x
function [y] = pol(p, x)
y = 0
for i=1:length(p)
y = y + p(i)*x^(i-1)
end
endfunction
// weight function evaluated in x, parametrized by a and b
// (poids = weight in french)
function [y] = poids(x, a, b)
y = (1-x)^a*(1+x)^b
endfunction
// scalpol compute scalar product between polynomial p1 and p2
// using integrate, the weight and the pol functions.
function [s] = scalpol(p1, p2, a, b)
s = integrate('poids(x,a, b)*pol(p1,x)*pol(p2,x)', 'x', -1, 1)
endfunction
// norm associated to scalpol
function [y] = normscalpol(f, a, b)
y = sqrt(scalpol(f, f, a, b))
endfunction
// finally the gram schmidt process on a family of polynome
// represented by a matrix
function [o] = gram_schmidt_pol(m, a, b)
[n,p] = size(m)
o(1:n) = m(1:n,1)/(normscalpol(m(1:n,1), a, b))
for k = 2:p
s =0
for i = 1:(k-1)
s = s + (scalpol(o(1:n,i), m(1:n,k), a, b) / scalpol(o(1:n,i),o(1:n,i), a, b) .* o(1:n,i))
end
o(1:n,k) = m(1:n,k) - s
o(1:n,k) = o(1:n,k) ./ normscalpol(o(1:n,k), a, b)
end
endfunction
By default, Scilab's integrate routine tries to achieve absolute error at most 1e-8 and relative error at most 1e-14. This is reasonable, but its treatment of relative error does not take into account the issues that occur when the exact value is zero. (See How to calculate relative error when true value is zero?). For this reason, even the simple
integrate('x', 'x', -1, 1)
throws an error (in Scilab 5.5.1).
And this is what happens in the process of running your program: some integrals are zero. There are two solutions:
(A) Give up on the relative error bound, by specifying it as 1:
integrate('...', 'x', -1, 1, 1e-8, 1)
(B) Add some constant to the function being integrated, then subtract from the result:
integrate('100 + ... ', 'x', -1, 1) - 200
(The latter should work in most cases, though if the integral happens to be exactly -200, you'll have the same problem again)
The above works for gram_schmidt_pol(eye(2,2), -1/2, -1/2) but for larger, say, gram_schmidt_pol(eye(9,9), -1/2, -1/2), it throws the error "The integral is probably divergent, or slowly convergent".
It appears that the adaptive integration routine can't handle the functions of the kind you have. A fallback is to use the simple inttrap instead, which just applies the trapezoidal rule. Since at x=-1 and 1 the function poids is undefined, the endpoints have to be excluded.
function [s] = scalpol(p1, p2, a, b)
t = -0.9995:0.001:0.9995
y = poids(t,a, b).*pol(p1,t).*pol(p2,t)
s = inttrap(t,y)
endfunction
In order for this to work, other related functions must be vectorized (* and ^ changed to .* and .^ where necessary):
function [y] = pol(p, x)
y = 0
for i=1:length(p)
y = y + p(i)*x.^(i-1)
end
endfunction
function [y] = poids(x, a, b)
y = (1-x).^a.*(1+x).^b
endfunction
The result is guaranteed to work, though the precision may be a bit lower: you are going to get some numbers like 3D-16 which are actually zeros.
Dear All I tried to find an answer googling but I haven't been able to find an answer.
I'm using fftw in an MPI Fotran application and i need to compute forward and backward transform of a 3D array of tensor component by component, and while in fourier space compute some complex tensorial quantities.
In order to make the array used by ffftw useful and don't spend a lot of time moving data from an array to another one the option that came into my mind was to declare a 5d dimensional array: i.e
use, intrinsic :: iso_c_binding
call MPI_INIT( mpi_err )
call MPI_COMM_RANK( MPI_COMM_WORLD, mpi_rank, mpi_err )
call MPI_COMM_SIZE( MPI_COMM_WORLD, mpi_size, mpi_err )
integer(C_INTPTR_T), parameter :: FFTDIM=3 !fft dimension
integer(C_INTPTR_T) :: fft_L !x direction
integer(C_INTPTR_T) :: fft_M !y direction
integer(C_INTPTR_T) :: fft_N !z direction
complex(C_DOUBLE_COMPLEX), pointer :: fft_in(:,:,:,:,:), fft_out(:,:,:,:,:)
type(C_PTR) :: fft_plan_fwd, fft_plan_bkw, fft_datapointer
integer(C_INTPTR_T) :: fft_alloc_local, fft_local_n0, fft_local_0_start
include 'mpif.h'
include 'fftw3-mpi.f03'
call fftw_mpi_init
fft_L=problem_dim(1)
fft_M=problem_dim(2)
fft_N=problem_dim(3)
! CALCULATE LOCAL SIZE OF FFT VARIABLE FOR EACH COMPOENNT
fft_alloc_local = fftw_mpi_local_size_3d(fft_N,fft_M,fft_L, MPI_COMM_WORLD, &
fft_local_n0, fft_local_0_start)
! allocate data pointer
fft_datapointer = fftw_alloc_complex(9*int(fft_alloc_local,C_SIZE_T))
! link pointers to the same array
call c_f_pointer(fft_datapointer, fft_in, [ FFTDIM, FFTDIM, fft_L, fft_M, fft_local_n0])
call c_f_pointer(fft_datapointer, fft_out, [ FFTDIM, FFTDIM, fft_L, fft_M, fft_local_n0])
! create plans
fft_plan_fwd = fftw_MPI_plan_dft_3d(fft_N, fft_M, fft_L, & !dimension
fft_in(1,1,:,:,:), fft_out(1,1,:,:,:), & !inpu, output
MPI_COMM_WORLD, FFTW_FORWARD, FFTW_MEASURE)
fft_plan_bkw = fftw_MPI_plan_dft_3d(fft_N, fft_M, fft_L, & !dimension
fft_in(1,1,:,:,:), fft_out(1,1,:,:,:), & !inpu, output
MPI_COMM_WORLD, FFTW_BACKWARD, FFTW_MEASURE)
now if use this peace of code and the number of processors is a multiple of 2 (2,4,8...) everything works fine, but if I use for instance 6 the application will give an error. how could i solve this issue?
do you have any better strategies instead of allocating a 5d array and without moving to many data??
Thanks in advance
Andrea
I found the solution to this problem utilizing the fffw_mpi_plan_many interface
the code performing this computation follows here. It calculate a 3D(LxMxN) complex to complex transform of tensor component by component (11,12,...) utilizing MPI capabilities. The extent on the third dimension(N) must be divisible for the number of core utilized
program test_fftw
use, intrinsic :: iso_c_binding
implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
integer(C_INTPTR_T) :: L = 8 ! extent of x data
integer(C_INTPTR_T) :: M = 8 ! extent of y data
integer(C_INTPTR_T) :: N = 192 ! extent of z data
integer(C_INTPTR_T) :: FFT_12_DIM = 3 ! tensor dimension
integer(C_INTPTR_T) :: ll, mm, nn, i, j
complex(C_DOUBLE_COMPLEX) :: fout
! many plan data variables
integer(C_INTPTR_T) :: howmany=9 ! numer of eleemnt of the tensor
integer :: rank=3 ! rank of the transform
integer(C_INTPTR_T), dimension(3) :: fft_dims ! array containing data extent
integer(C_INTPTR_T) :: alloc_local_many, fft_local_n0, fft_local_0_start
complex(C_DOUBLE_COMPLEX), pointer :: fft_data(:,:,:,:,:)
type(C_PTR) ::fft_datapointer, plan_many
integer :: ierr, myid, nproc
! Initialize
call mpi_init(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call fftw_mpi_init()
! write data dimenion in reversed order
fft_dims(3) = L
fft_dims(2) = M
fft_dims(1) = N
! use of alloc many
alloc_local_many = fftw_mpi_local_size_many(rank, & ! rank of the transform in this case 3
fft_dims, & ! array containing data dimension in reversed order
howmany, & ! numebr of transform to compute in this case 3x3=9
FFTW_MPI_DEFAULT_BLOCK, & !default block size
MPI_COMM_WORLD, & ! mpi communicator
fft_local_n0, & ! local numebr of slice by core
fft_local_0_start) ! local shift on the last dimension
fft_datapointer = fftw_alloc_complex(alloc_local_many) ! allocate aligned memory for the data
! associate data pointer with allocated memory: note natural order
call c_f_pointer(fft_datapointer, fft_data, [FFT_12_DIM,FFT_12_DIM,L,M, fft_local_n0])
! create the plan for many inplace multidimensional transform
plan_many = fftw_mpi_plan_many_dft( &
rank , fft_dims, howmany, &
FFTW_MPI_DEFAULT_BLOCK, FFTW_MPI_DEFAULT_BLOCK, &
fft_data, fft_data, &
MPI_COMM_WORLD, FFTW_FORWARD, FFTW_ESTIMATE )
! initialize data to some function my_function(i,j)
do nn = 1, fft_local_n0
do mm = 1, M
do ll = 1, L
do i = 1, FFT_12_DIM
do j = 1, FFT_12_DIM
fout = ll*mm*nn*i*j
fft_data(i,j,ll,mm,nn) = fout
end do
end do
end do
end do
enddo
call fftw_mpi_execute_dft(plan_many, fft_data, fft_data)!
call fftw_destroy_plan(plan_many)
call fftw_mpi_cleanup()
call fftw_free(fft_datapointer)
call mpi_finalize(ierr)
end program test_fftw
thanks everyone for the help !!
I have two p*n arrays, y and ymiss. y contains real numbers and NA's. ymiss contains 1's and 0's, so that if y(i,j)==NA, ymiss(i,j)==0, and 1 otherwise. I also have 1*n array ydim which tells how many real numbers there is at y(1:p,n), so ydim has values 0 to p
In R programming language, I can do following:
if(ydim!=p && ydim!=0)
y(1:ydim(t), t) = y(ymiss(,t), t)
That code arranges all real numbers of y(,t) in like this
first there's for example
y(,t) = (3,1,NA,6,2,NA)
after the code it's
y(,t) = (3,1,6,2,2,NA)
Now I will only need those first 1:ydim(t), so it doesn't matter what those rest are.
The question is, how can I do something like that in Fortran?
Thanks,
Jouni
The "where statement" and the "merge" intrinsic function are powerful, operating on selected positions in arrays, but they don't move items to the front of an array. With old-fashioned code with explicit indexing (could be packaged into a function) e.g.:
k=1
do i=1, n
if (ymiss (i) == 1) then
y(k) = y(i)
k = k + 1
end if
end do
What you want could be done with array intrinsics using the "pack" intrinsic. Convert ymiss into a logical array: 0 --> .false., 1 --> .true.. Then use code like (tested without the second index):
y(1:ydim(t), t) = pack (y (:,t), ymiss (:,t))
Edit to add example code, showing use of Fortran intrinsics "where", "count" and "pack". "where" alone can't solve the problem, but "pack" can. I used "< -90" as NaN for this example. The step "y (ydim+1:LEN) = -99.0" isn't required by the OP, who doesn't need to use these elements.
program test1
integer, parameter :: LEN = 6
real, dimension (1:LEN) :: y = [3.0, 1.0, -99.0, 6.0, 2.0, -99.0 ]
real, dimension (1:LEN) :: y2
logical, dimension (1:LEN) :: ymiss
integer :: ydim
y2 = y
write (*, '(/ "The input array:" / 6(F6.1) )' ) y
where (y < -90.0)
ymiss = .false.
elsewhere
ymiss = .true.
end where
ydim = count (ymiss)
where (ymiss) y2 = y
write (*, '(/ "Masking with where does not rearrange:" / 6(F6.1) )' ) y2
y (1:ydim) = pack (y, ymiss)
y (ydim+1:LEN) = -99.0
write (*, '(/ "After using pack, and ""erasing"" the end:" / 6(F6.1) )' ) y
stop
end program test1
Output is:
The input array:
3.0 1.0 -99.0 6.0 2.0 -99.0
Masking with where does not rearrange:
3.0 1.0 -99.0 6.0 2.0 -99.0
After using pack, and "erasing" the end:
3.0 1.0 6.0 2.0 -99.0 -99.0
In Fortran you can't store na in an array of real numbers, you can only store real numbers. So you'll probably want to replace na's with some value not likely to be present in your data: huge() might be suitable. 2D arrays are no problem at all for Fortan. You might want to use a 2D array of logicals to replace ymiss rather than a 2D array of 1s and 0s.
There is no simple, intrinsic to achieve what you want, you'd need to write a function. However, a more Fortran way of doing things would be to use the array of logicals as a mask for the operations you want to carry out.
So, here's some fragmentary Fortran code, not tested:
! Declarations
real(8), dimension(m,n) :: y, ynew
logical, dimension(m,n) :: ymiss
! Executable
where (ymiss) ynew = func(y) ! here func() is whatever your function is