Loss of precision: Parallel 2D FFT using 1D FFTW and MPI calls

Loss of precision: Parallel 2D FFT using 1D FFTW and MPI calls - mpi

I am trying to match the result of doing a 2D FFT using the already implemented calls in FFTW and my own version of 2D FFT via 1D FFTw calls and mpi communication.
So, resuming, I've followed the theory:
1 - FFT in y dimension
2 - transpose the matrix
3 - MPI_Alltoall communication
4- FFT in x dimension
5- transpose back
6 - MPI_Alltoall communication
I've tried with a small number of processors (8- 12) and it seems to work fine. Correctness has been carried out using RMS between the 2D FFTW call and my own result. However, as I increase the number of cores and size of the matrix, it seems that I am loosing precision, ie. RMS fails because the error is larger than expected (I set the error to 1.0e-10):
Given a matrix of 512x512 and considering RMS of tolerance of 1.0e-6:
"ERROR: Position 1 0 expected im-part 10936907150.600960 and got 10936907150.600958"
Given a matrix of 2048x2048 and considering RMS of tolerance of 1.0e-6:
"ERROR:Position 1 0 expected real part -4294967296.000107 and got -4294967295.999999"
Why would I lose precision if all I am using are double types?

Related

Weak scaling of mpi program for matrix-vector multiplication

I have written some mpi code that solves systems of equations using the conjugate gradient method. In this method matrix-vector multiplication takes up most of the time. As a parallelization strategy, I do the multiplication in blocks of rows and then I
gather the results in the root process. The remaining steps are performed by the root process which broadcasts the results whenever a matrix-vector multiplication needs to be performed.
The strong scaling curve representing the speedup is fine
But the weak scaling curve representing the efficiency is quite bad
In theory, the blue curve should be close to the red one.
Is this intrinsic to the parallelization strategy or am I doing something wrong?
Details
The measurements are in seconds. The experiments are performed on a cluster where each node has
2 Skylake processors running at 2.3 GHz, with 18 cores each,192 GB of DDR3 RAM and 800GB NVMe local drive. Amdahl's prediction is computed with the formula (0.0163 + 0.9837 / p)^-1. Gustafson's prediction is computed with the formula 0.9873+0.0163/p where p is the number of processors. The experimental values are in both cases obtained by dividing the time spent by a single computation unit by the time spent by p computation units.
For weak scaling, I start with a load per processor of W = 1768^2 matrix entries. Then the load with p processors will be M^2 = pW matrix cells. Thus, we set the matrix's side to M = 1768 \sqrt{p} for p processes. This gives: 1768, 3536, 5000, 7071 and 10000 cells for 1, 2, 4, 8, 16, 32 processors respectively. I also fix the number of iterations to 500 so that the measurements are not affected by the variability in the data.

I think your Amdahl formula is wrong. It should be:
S_p = F_s + p F_p
You have a division that should be a multiplication. See for instance https://theartofhpc.com/istc/parallel.html#Gustafson'slaw

apply fourier shift theorem to complex signal

Im trying to apply the fourier phase shift theorem to a complex signal in R. However, only the magnitude of my signal shifts as I expect it. I think it should be possible to apply this theorem to complex signals, so probably I make an error somewhere. My guess is that there is an error in the frequency axis I calculate.
How do I correctly apply the fourier shift theorem to a complex signal (using R)?
i = complex(0,0,1)
t.in = (1+i)*matrix(c(1,0,0,0,0,0,0,0,0,0))
n.shift = 5
#the output of fft() has the mean / 0 frequency at the first element
#it then increases to the highest frequency, flips to negative frequencies
#and then increases again to the negative frequency closest to 0
N = length(t.in)
if (N%%2){#odd
kmin = -(N-1)/2
kmax = (N-1)/2
} else {#even
kmin = -N/2
kmax = N/2-1
#center frequency negative, is that correct?
}
#create frequency axis for fft() output, no sampling frequency or sample duration needed
k = (kmin:kmax)
kflip = floor(N/2)
k = k[c((kflip+1):N,1:kflip)]
f = 2*pi*k/N
shiftterm = exp( -i*n.shift*f )
T.in = fft(t.in)
T.out = T.in*shiftterm
t.out = fft(T.out, inverse=T)/N
par(mfrow=c(2,2))
plot(Mod(t.in),col="green");
plot(Mod(t.out), col="red");
plot(Arg(t.in),col="green");
plot(Arg(t.out),col="red");
As you can see the magnitude of the signal is nicely shifted, but the phase is scrambled. I think the negative frequencies are where my error is, but I cant see it.
What am I doing wrong?
The questions about fourier phase shift theorem I could find:
real 2d signal in python
real 2d signal in matlab
real 1d signal in python
math question about what fourier shift does
But these were not about complex signals.

Answer
As Steve suggested in the comments, I checked the phase on the 6th element.
> Arg(t.out)[6]
[1] 0.7853982
> Arg(t.in)[1]
[1] 0.7853982
So the only element that has a magnitude (at least one order of magnitude higher than the EPS) does have the phase that I expected.
TL;DR The result from the original approach in the question was already correct, we see the Gibbs Phenomenon sliding by.
Just discard low magnitude elements?
If ever the phase of elements that should be zero will be a problem I can run t.out[Mod(t.out)<epsfactor*.Machine$double.eps] = 0 where in this case epsfactor has to be 10 to get rid of the '0' magnitude elements.
Adding that line before plotting gives the following result, which is what I expected to get beforehand. However, the 'scrambled' phase might actually be accurate in most cases as I'll explain below.
The original result really was correct
Just setting low magnitude elements to 0 does not make the phase of the shifted signal more intuitive however. This is a plot where I apply a 4.5 sample shift, the phase is still 'scrambled'.
Applying fourier shift equivalent to downsmapling shifted fourier interpolation
It occurred to me that applying a non-integer number of elements phase shift is equivalent to fourier interpolating the signal and then downsample the interpolated signal at points between the original elements. Since the vector I used as input is an impulse function, the fourier interpolated signal is just not well behaved. Then the signal after applying the fourier phase shift theorem can be expected to have exactly the phase that the fourier interpolated signal has, as seen below.
Gibbs Ringing
Its just at the discontinuities where phase is not well behaved and where small rounding errors might cause large errors in the reconstructed phase. So not really related to low magnitude but to not well defined fourier transform of the input vector. This is called Gibbs Ringing, I could use low-pass filtering with a gaussian filter to decrease it.
Questions related to fourier interpolation and phase shift
symbolic approach in R to estimate fourier transform error
non integer signal shift by use of linear interpolation
downsampling complex signal
fourier interpolation application
estimating sub-sample shift between two signals using fourier transforms
estimating sub-sample shift between two signals without interpolation

Seeding square roots on FPGA in VHDL for Fixed Point

I'm attempting to create a fixed-point square root function for a Xilinx FPGA (hence real types are out, and David Bishops ieee_proposed library is also unsupported for XST synthesis).
I've settled on a Newton-Raphson method to calculate the reciprocal square root (as it involves fewer divisions).
One of the remaining dilemmas I have is how to generate the initial seed. I looked at the Fast Inverse Square Root, but it only appears to work for floating point arithmetic.
My best thoughts at the moment are, to take the length of the input value (ie. find the index of the most significant, non-zero bit), halve it crudely and use that as the power for a seed.
I wrote a short test script to quickly check the accuracy (its in Matlab but that's just so I could plot a graph...)
x = 1:2^24;
gen_result = zeros(1,length(x));
seed_vals = zeros(1,length(x));
for i = 1:length(x)
result = 2^-ceil(log2(x(i))/2); %effectively creates seed value from top bit index
seed_vals(i) = 1/result; %Store seed value
for j = 1:6
result = result*(1.5-0.5*x(i)*result^2); %reciprocal root
end
gen_result(i) = 1/result; %single division at the end
end
And unsurprisingly, the seed becomes wildly inaccurate each time a number increases in size, and this increases as the magnitude of the input increases. As a graph this can be seen as:
The red line is the value of the seed, and as can be seen, is increasing increasing in powers of 2.
My question very simple: Are there any other simple methods I could use to generate a seed value for fixed point square root values in VHDL, ideally which don't cause ever increasing amounts of inaccuracy (and hence require more iterations each time the input increases in size).
Any other incidental advise on how to approach finding fixed points square roots in VHDL would be gratefully received!

I realize this is an old question but I did end up here and this was kind of useful so I want to add my bit.
Assuming your Xilinx chip has an embedded multiplier, you could consider this approach to help get a better starting seed. The basic premise is to convert the input integer to fixed point with all fraction bits, and then use the embedded multiplier to scale half of your initial seed value by 0.X (which in hindsight is probably what people mean when they say "normalize to the region [0.5..1)", now that I think about it). It's basically piecewise linear interpolation of your existing seed method. The steps below should translate relatively easily to RTL, as they're just bit-shifts, adds, and one unsigned multiply.
1) Begin with your existing seed value (e.g. for x=9e6, you would generate s=4096 as the seed for your first guess with your "crude halving" method)
2) Right-shift the existing seed value by 1 to get the previous seed value (s_half = s >> 1 = 2048)
3) Left-shift the input until the most significant bit is a 1. In the event you are sqrting 32-bit ints, x_scale would then be 2304000000 = 0x89544000
4) Slice the upper e.g. 18 bits off of x_scale and multiply by an 18-bit version of s_half (I suggest 18 because I happen to know some Xilinx chips have embedded 18x18 multipliers). For this case, the result, x_scale(31 downto 14) = 140625 = 0x22551.
At least, that's what the multiplier thinks - we're going to use fixed point so that it's actually 0b0.100010010101010001 = 0.53644 instead of 140625.
The result of this multiplication will be s_scale = s_half * x_scale(31 downto 14) = 2048 * 140625 = 288000000, but this output is in 18.18 format (18 integer bits, 18 fraction bits). Take the upper 18 bits, and you get s_scale(35 downto 18) = 1098
5) Add the upper 18 bits of s_scale to s_half to get your improved seed, in this case s_improved = 1098+2048 = 3146
Now you can do a few iterations of Newton-Raphson with this seed. For x=9e6, your crude halving approach would give an initial seed of 4096, the fixed-point scale outlined above gives you 3146, and the actual sqrt(9e6) is 3000. This value is half-way between your seed steps, and my napkin math suggests it saved about 3 iterations of Newton-Raphson

Does anyone know how to run 4-dimensional or larger problems using kde in the ks package in R

I am trying to get the kde for a four-dimensional dataset using the kde function in the ks package, but have not been successful. I am running the following code:
kde(m, h=delta, gridsize = n.grid)
where m is a n x 4 matrix. I have n features in my dataset with 4 different variables. I have tried running this function with an n x 3 matrix and the function works great, returning a 3 dimensional array kernel density estimate. When I run it with the four dimensional data matrix however it says I must supply the evaluation points (which is weird since the documentation says I only need to do that for d > 4).
So, I ended up creating a new evaluation point matrix that is n.grid x 4 in size with n.grid equally spaced points from the original data matrix m. However, when I run this, it returns to me a 1 dimensional array of estimates instead of a 4 dimensional array.
Does anyone know how to run kde properly for dimensions greater than 3?

How can I determine if my convolution is separable?

What makes a convolution kernel separable? How would I be able to tell what those separable parts were in order to do two 1D convolutions instead of a 2D convolution>
Thanks

If the 2D filter kernel has a rank of 1 then it is separable. You can test this in e.g. Matlab or Octave:
octave-3.2.3:1> sobel = [-1 0 1 ; -2 0 2 ; -1 0 1];
octave-3.2.3:2> rank(sobel)
ans = 1
octave-3.2.3:3>
See also: http://blogs.mathworks.com/steve/2006/11/28/separable-convolution-part-2/ - this covers using SVD (Singular Value Decomposition) to extract the two 1D kernels from a separable 2D kernel.
See also this question on DSP.stackexchange.com: Fast/efficient way to decompose separable integer 2D filter coefficients

you can also split the matrix into symmetric and skew parts and separate each part, which can be effective for larger 2d convolutions.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex