Parallel derivatives of multidimensional real data with FFTW - mpi

I would like to build a 2D MPI-parallel spectral differentiation code.
The following piece of code seems to work fine for the x-derivative, both in serial and in parallel:
alloc_local = fftw_mpi_local_size_2d(N0,N1,MPI_COMM_WORLD,&local_n0, &local_0_start);
fplan_u = fftw_mpi_plan_dft_2d(N0,N1,ptr_u,uhat,MPI_COMM_WORLD,FFTW_FORWARD,FFTW_ESTIMATE);
bplan_x = fftw_mpi_plan_dft_2d(N0,N1,uhat_x,ptr_ux,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
ptr_u = fftw_alloc_real(alloc_local);
ptr_ux = fftw_alloc_real(alloc_local);
uhat = fftw_alloc_complex(alloc_local);
uhat_x = fftw_alloc_complex(alloc_local);
fftw_execute(fplan_u);
// first renormalize the transform...
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
// then compute the x-derivative
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1/2;i++)
uhat_x[j*N1+i] = I*pow(i,1)*uhat[j*N1+i]/(double)N0;
for (int j=0;j<local_n0;j++)
for (int i=N1/2;i<N1;i++)
uhat_x[j*N1+i] = -I*pow(N1-i,1)*uhat[j*N1+i]/(double)N0;
fftw_execute(bplan_x);
However, the following code for the y-derivative works well only in serial, and not in parallel.
bplan_y = fftw_mpi_plan_dft_2d(N0,N1,uhat_y,ptr_uy,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
fftw_execute(fplan_u);
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
if (size == 1){ // in the serial case do this
for (int j=0;j<local_n0/2;j++)
for (int i=0;i<N1;i++)
uhat_y[j*N1+i] = I*pow(j,1)*uhat[j*N1+i]/(double)N0;
for (int j=local_n0/2;j<local_n0;j++)
for (unsigned int i=0;i<N1;i++)
uhat_y[j*N1+i] = -I*pow(N1-j,1)*uhat[j*N1+i]/(double)N0;
} else { // in the parallel case instead do this
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
if (rank <(size/2))
uhat_y[j*N1+i] = I*pow(j+local_n0*rank,1)*uhat[j*N1+i]/(double)N0;
else
uhat_y[j*N1+i] = -I*pow(N1-(j+local_n0*rank),1)*uhat[j*N1+i]/(double)N0;
}
fftw_execute(bplan_y);
where the conditional expression if(rank<(size/2)) is used for dealiasing purposes (because I am doing FFTs of real data). The dealiasing appears to be correctly coded in serial for the y derivative, and also in parallel for the x derivative.
I know that FFTW provides a series of functions specific for this topic (fftw_mpi_plan_dft_r2c and ..._c2r) but I would like to use anyway the complex to complex routines, because in the future I might consider complex data.
Of course, using fftw_mpi_plan_transpose to transpose the data, take a x derivative and then transpose back works in parallel, but I would like to avoid this because in the future I plan solving diffusive-like problems implicitly, e.g.
uhat[i*N0+j] = fhat[i*N0+j]/(pow((double)i,2)+pow((double)j,2)))
and this would not be possible with the transpose plans.
Edit: strangely enough, if I run the above code with 3 MPI processes, e.g.:
my_rank: 0 alloc_local: 87552 local_n0: 171 local_0_start: 0
my_rank: 1 alloc_local: 87552 local_n0: 171 local_0_start: 171
my_rank: 2 alloc_local: 87040 local_n0: 170 local_0_start: 342
I get results that are still wrong, but look much closer to the correct solution than what I obtain with with 2,4, or 8 processes.

Related

Loop for minimum spanning tree does not work

we as 3 friends try to solve minimum spanning tree with coflicts problem using r. In solving this question, we read files in .txt format that contain for ex.
"1 2 5
2 4 6" etc. which indicates from node 1 to 2, there exists an edge with weight 5 and
"1 2 2 4" etc. which indicates there's a conflict relationship between the edges 1-2 and 2-4. To continue, we have to form an nxn conflict matrix in which we will store 0's if there exist no conflict relation between the edges or 1 if there exist a conflict relation. For this purpose, we developed a 3-for loop for(i in 1:dim(edges_read)[1]){
for(i in 1:dim(edges_read)[1]){
for(k in 1:dim(edges_read)[1]){
for(t in 1:dim(conflicts)[1]){
if(all(conflicts[t,] == c(edges_read[i,1], edges_read[i,2],
edges_read[k,1], edges_read[k,2]) )){
conflictmatrix[i,k] <- 1
}
}
}
}
However, R cannot get us a solution and this for loops take very long times. How can we solve this situation? Thanks for further assistance
As you have discovered, for() loops are not fast in R. There are faster approaches, but it's hard to provide examples without data. Please use something like dput(edges_read) and dput(conflicts) to provide a small example of the data.
As one example, you could implement the for loops in the Rcpp package for speed improvement. Based on the code in your question, you could re-implement the 3-loop code sort of like this:
Rcpp::cppFunction('NumericVector MSTC_nxn_Cpp(NumericMatrix edges_read, NumericMatrix conflicts){
int n = edges_read.nrow(); //output matrix size (adjust to what you need)
int m = conflicts.nrow(); //output matrix size (adjust to what you need)
NumericMatrix conflictmatrix( n , m ); //the output matrix
for(int i=0;i<n;i++){ //your i loop
for(int k=0;k<n;k++){ // your k loop
double te = edges_read( i, 0 ); //same as edges_read[i,1]
double tf = edges_read( i, 1 ); //same as edges_read[i,2]
double tg = edges_read( k, 0 ); //same as edges_read[k,1]
double th = edges_read( k, 1 ); //same as edges_read[k,2]
NumericVector w = NumericVector::create(te,tf,tg,th); //this could probably be more simple
for(int t=0;t<m;t++){ //your t loop
NumericVector v = conflicts( t , _ ); // same as conflicts[t,]
LogicalVector r; //vector for checking if conflicts and edges are the same
for(int p=0; p<4; p++){ //loop to check logic
r[p]=v[p]==w[p]; //True / False stored
};
int q = r.size();
for (int ii = 0; ii < q; ++ii) { //similar to all() This code could be simplified!
if (!r[ii]) {false;}
else{conflictmatrix[i,k] = 1;}}
}}}
return conflictmatrix; //your output
}')
#Then run the function
MSTC_nxn_Cpp(edges_read, conflicts )

How to find a pair of numbers in a list given a specific range?

The problem is as such:
given an array of N numbers, find two numbers in the array such that they will have a range(max - min) value of K.
for example:
input:
5 3
25 9 1 6 8
output:
9 6
So far, what i've tried is first sorting the array and then finding two complementary numbers using a nested loop. However, because this is a sort of brute force method, I don't think it is as efficient as other possible ways.
import java.util.*;
public class Main {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int n = sc.nextInt(), k = sc.nextInt();
int[] arr = new int[n];
for(int i = 0; i < n; i++) {
arr[i] = sc.nextInt();
}
Arrays.sort(arr);
int count = 0;
int a, b;
for(int i = 0; i < n; i++) {
for(int j = i; j < n; j++) {
if(Math.max(arr[i], arr[j]) - Math.min(arr[i], arr[j]) == k) {
a = arr[i];
b = arr[j];
}
}
}
System.out.println(a + " " + b);
}
}
Much appreciated if the solution was in code (any language).
Here is code in Python 3 that solves your problem. This should be easy to understand, even if you do not know Python.
This routine uses your idea of sorting the array, but I use two variables left and right (which define two places in the array) where each makes just one pass through the array. So other than the sort, the time efficiency of my code is O(N). The sort makes the entire routine O(N log N). This is better than your code, which is O(N^2).
I never use the inputted value of N, since Python can easily handle the actual size of the array. I add a sentinel value to the end of the array to make the inner short loops simpler and quicker. This involves another pass through the array to calculate the sentinel value, but this adds little to the running time. It is possible to reduce the number of array accesses, at the cost of a few more lines of code--I'll leave that to you. I added input prompts to aid my testing--you can remove those to make my results closer to what you seem to want. My code prints the larger of the two numbers first, then the smaller, which matches your sample output. But you may have wanted the order of the two numbers to match the order in the original, un-sorted array--if that is the case, I'll let you handle that as well (I see multiple ways to do that).
# Get input
N, K = [int(s) for s in input('Input N and K: ').split()]
arr = [int(s) for s in input('Input the array: ').split()]
arr.sort()
sentinel = max(arr) + K + 2
arr.append(sentinel)
left = right = 0
while arr[right] < sentinel:
# Move the right index until the difference is too large
while arr[right] - arr[left] < K:
right += 1
# Move the left index until the difference is too small
while arr[right] - arr[left] > K:
left += 1
# Check if we are done
if arr[right] - arr[left] == K:
print(arr[right], arr[left])
break

Do "if conditions" affect the performance of kernel execution in OpenCL?

I am running a filter on an image and I perform a vertical pass followed by a horizontal pass. The function for this task is same for both the passes, only the argument values change. I'm calling the function in a loop. For vectorizing the operations in that function I have to write separate function calls for the two passes. The loop is now separate for horizontal and vertical passes. An "if condition" is now added because of this change and I noticed that even though the computations are vectorized, the kernel is taking longer to execute. I have run the code several times and the average time taken with the vectorized code is more than the original code. Is it because of the "if condition" plugged in the code?
Original code
global int* a;
for(int i = 0; i < 4; i++)
{
filter(a + i, b, c);
}
Modified code
global int* a;
if(offset == 1)
for(int i = 0; i < 4; i++)
{
filter_vertical(a + i, b, c);
}
else
filter_horizontal(a, b, c);
Did you mean offset == 1 ?
if(offset = 1)
assigns 1 to offset which is an "extra latency" per thread. This is slower than original. But apart from that, "if" changes performance up or down depending on the pattern of a branch "taken" or "not taken" grouped together because some architectures like GPU SIMD, fills bubbles to parallel SIMD pipelines when those are not same branch option with a neighbor pipeline so they are left to other wavefront threads' occupation opportunities, if they can't fill neiter, it will have less performance.
For more performance,
for(int i = 0; i < 4; i++)
{
filter_vertical(a + i, b, c);
}
to
filter_vertical(a , b, c);
filter_vertical(a + 1, b, c);
filter_vertical(a + 2, b, c);
filter_vertical(a + 3, b, c);
needs more instruction cache but, needs less branches, needs less memory usage and less cycles.
If you can group offset == 1 cases together, it would be faster if memory access operations doesn't affect it.

QTableView.setColumnHidden() alternative

Let's suppose I have a table with 30 columns, and want to use QSqlTableModel/QTableView and show only 5 columns. Is there some other way besides 25 times calling a setColumnHidden() function?
model = QSqlTableModel(self)
model.setTable("table")
...
view = QTableView()
view.setModel(model)
...
#insane:
view.setColumnHidden(0, True)
view.setColumnHidden(4, True)
view.setColumnHidden(6, True)
view.setColumnHidden(7, True)
view.setColumnHidden(9, True)
view.setColumnHidden(10, True)
view.setColumnHidden(11, True)
...
view.setColumnHidden(29, True)
And what if DBA add some new columns that I don't want user to see. Making changes to all installed apps to add some new view.setColumnHidden(n, True) rows? Not so practical.
Maybe there is some Qt function like view.setColumnsShown([1,2,3,5,8]) I'm not aware of?
You could define your own setColumnsShown() function:
from sets import Set
def setColumnsShown(view, showcols):
allcols = Set(range(0, view.model().columnCount()))
for col in allcols.difference(showcols):
view.setColumnHidden(col, True)
To handle the case where new columns may be added, you could connect the columnsInserted() signal of QSqlTableModel to a handler function that re-calls setColumnsShown.
I think that there isn't this function, but if you want do this automatically you can write your own function or code snippet, whick will work as you want. Unfortunately I don't familiar with Qt+Python but in C++ it can be done with this code. It is a few loops, so I think that you be able to write same code with Python syntaxis. Also I wrote comments, to show how exactly this code works.
QList<int> list;//create list where we set number of columns to be shown
list<< 1<<2;//write in list numbers of columns
int c = ui->tableView->model()->columnCount();//get count of columns
for (int i = 0; i < c; ++i)
{
ui->tableView->setColumnHidden(i,true);//hide all columns
}
for (int i = 0; i < list.length(); ++i)
{
if(list.at(i) < c)
ui->tableView->setColumnHidden(list.at(i),false);//show columns which we want
}

issue with OpenCL stencil code

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Resources