Border exchange in cartesian topology mpi

Border exchange in cartesian topology mpi - mpi

I would like to perform border exchange in my mpi programm.
I have structure which look like that :
cell** local_petri_A;
local_petri_A = calloc(p_local_petri_x_dim,sizeof(*local_petri_A));
for(int i = 0; i < p_local_petri_x_dim ; i ++){
local_petri_A[i] = calloc(p_local_petri_y_dim,sizeof(**local_petri_A));
}
where cell is :
typedef struct {
int color;
int strength;
} cell;
I would like to have an exchange scheme like on this picture :
So i put my program in a cartesian topology and first define mpi type to perform the exchange :
void create_types(){
////////////////////////////////
////////////////////////////////
// cell type
const int nitems=2;
int blocklengths[2] = {1,1};
MPI_Datatype types[2] = {MPI_INT, MPI_INT};
MPI_Aint offsets[2];
offsets[0] = offsetof(cell, color);
offsets[1] = offsetof(cell, strength);
MPI_Type_create_struct(nitems, blocklengths, offsets, types, &mpi_cell_t);
MPI_Type_commit(&mpi_cell_t);
////////////////////////////////
///////////////////////////////
MPI_Type_vector ( x_inside , 1 , 1 , mpi_cell_t , & border_row_t );
MPI_Type_commit ( & border_row_t );
/*we put the stride to x_dim to get only one column*/
MPI_Type_vector ( y_inside , 1 , p_local_petri_x_dim , MPI_DOUBLE , & border_col_t );
MPI_Type_commit ( & border_col_t );
}
and then finally try to perform the exchange from south and north:
/*send to the north receive from the south */
MPI_Sendrecv ( & local_petri_A[0][1] , 1 , border_row_t , p_north , TAG_EXCHANGE ,& local_petri_A [0][ p_local_petri_y_dim -1] , 1 , border_row_t , p_south , TAG_EXCHANGE ,cart_comm , MPI_STATUS_IGNORE );
/*send to the south receive from the north */
MPI_Sendrecv ( & local_petri_A[0][ p_local_petri_y_dim -2] , 1 , border_row_t , p_south , TAG_EXCHANGE ,& local_petri_A [0][0] , 1 , border_row_t , p_north , TAG_EXCHANGE ,cart_comm , MPI_STATUS_IGNORE );
NB: in this section x_inside and y_inside are the "inside" dimension of the array (without ghost part) and p_local_petri_dim are dimensions of the full array.
Then i have this error :
Is there something that i've done wrong ?
Thank you in advance for your help.

The issue is in the way you allocate your 2D array.
You allocate an array of arrays, so two rows are unlikely in consecutive memory. As a consequence, your ddt for a column does not match your 2D array layout.
You can refer to MPI_Bcast a dynamic 2d array to allocate your array correctly.
As a side note, Fortran does not have this kind of problem, so if this is an option, that would make your life easier.

Related

Recursive FIbonacci arm Assembly

Edit: I have removed my code as I do not want to get caught for cheating on my assignment. I will repost the code once my assignment has been submitted. I apologize for posting it on stack overflow, I just had no where else to go for help. Please respect my edit to remove the code. I have tried deleting it, but it will not let me as I need to request it.
[MIPS code I was trying to follow][1]
[C Code I was trying to follow][2]
I am trying to convert recursive fibonacci code into arm assembly but I am running into issues. When running my arm assembly, the final value of the sum is 5 when it should be 2. It seems as though my code loops but maybe one too many times. Any help would be much appreciated as I am new to this.

This is what your code is doing, and below is a test run.  This simply isn't a usual recursive fibonacci.
#include <stdio.h>
void f ( int );
int R2 = 0;
int main () {
for ( int i = 0; i < 10; i++ ) {
R2 = 0;
f ( i );
printf ( "f ( %d ) = %d\n", i, R2 );
}
}
void f ( int n ) {
if ( n == 0 ) { R2 += 0; return; }
if ( n == 1 ) { R2 += 1; return; }
f ( n-1 );
f ( n-2 );
R2 += n-1;
}
f ( 0 ) = 0
f ( 1 ) = 1
f ( 2 ) = 2
f ( 3 ) = 5
f ( 4 ) = 10
f ( 5 ) = 19
f ( 6 ) = 34
f ( 7 ) = 59
f ( 8 ) = 100
f ( 9 ) = 167
Either you started with a broken Fibonacci algorithm, or substantially changed it going to assembly.  I don't know how this can be fixed, except by following a working algorithm.

Note that in the C code the only addition is in the fib(n-1) + fib(n-2). In particular the special cases just do return 0; and return 1; respectively. Thus your else add 0/1 to sum lines are wrong. You should replace your additions with moves.
Also, you do MOV R1, R0 //copy fib(n-1) which is incorrect because the fib(n-1) has been returned in R2 not R0. That should be MOV R1, R2.
With these changes the code works, even if it is slightly non-standard.

Is there a Intel SIMD comparison function that returns 0 or 1 instead of 0 or 0xFFFFFFFF?

I'm currently using the intel SIMD function: _mm_cmplt_ps( V1, V2 ).
The function returns a vector containing the results of each component test. Based on if V1 components are less than V2 components, example:
XMVECTOR Result;
Result.x = (V1.x < V2.x) ? 0xFFFFFFFF : 0;
Result.y = (V1.y < V2.y) ? 0xFFFFFFFF : 0;
Result.z = (V1.z < V2.z) ? 0xFFFFFFFF : 0;
Result.w = (V1.w < V2.w) ? 0xFFFFFFFF : 0;
return Result;
However is there a function like this that returns 1 or 0 instead? A function that uses SIMD and no workarounds because it is supposed to be optimized + vectorized.

You can write that function yourself. It’s only 2 instructions:
// 1.0 for lanes where a < b, zero otherwise
inline __m128 compareLessThan_01( __m128 a, __m128 b )
{
const __m128 cmp = _mm_cmplt_ps( a, b );
return _mm_and_ps( cmp, _mm_set1_ps( 1.0f ) );
}
Here’s more generic version which returns either of the 2 values. It requires SSE 4.1 which is almost universally available by now with 97.94% of users, if you have to support SSE2-only, emulate with _mm_and_ps, _mm_andnot_ps, and _mm_or_ps.
// y for lanes where a < b, x otherwise
inline __m128 compareLessThan_xy( __m128 a, __m128 b, float x, float y )
{
const __m128 cmp = _mm_cmplt_ps( a, b );
return _mm_blendv_ps( _mm_set1_ps( x ), _mm_set1_ps( y ), cmp );
}

The DirectXMath no-intrinsics version of _mm_cmplt_ps is actually:
XMVECTORU32 Control = { { {
(V1.vector4_f32[0] < V2.vector4_f32[0]) ? 0xFFFFFFFF : 0,
(V1.vector4_f32[1] < V2.vector4_f32[1]) ? 0xFFFFFFFF : 0,
(V1.vector4_f32[2] < V2.vector4_f32[2]) ? 0xFFFFFFFF : 0,
(V1.vector4_f32[3] < V2.vector4_f32[3]) ? 0xFFFFFFFF : 0
} } };
return Control.v;
XMVECTOR is the same as __m128 which is 4 floats so it needs the alias to make sure it's writing integers.
I use _mm_movemask_ps for the "Control Register" version of DirectXMath functions. It just collects the top-most bit of each SIMD value.
int result = _mm_movemask_ps(_mm_cmplt_ps( V1, V2 ));
The lower nibble of result will contain bit patterns. A 1 bit for each value that passes the test, and a 0 bit for each value that fails the test. This could be used to reconstruct 1 vs. 0.

Maximum no. of nodes reachable from a given source in a Graph

I have a directed graph in which each node has exactly one edge, to one other node. I have to find the node from which the maximum number of nodes can be reached.
I tried to do it using a dfs, and store the information in the array sum[] but I get segmentation faults.
The graph is represented as an adjacency List of pair< int, int >. First is the destination, and second is the weight. In this problem weight = 0.
My dfs implementation:
int sum[V]; // declared globally, initially set to 0
bool visited[V]; // declared globally, initially set to false
int dfs( int s ){
visited[s]= true;
int t= 0;
for( int i= 0; i< AdjList.size(); ++i ){
pii v= AdjList[s][i];
if( visited[v.first] )
return sum[v.first];
t+= 1 + dfs( v.first );
}
return sum[s]= t;
}
Inside main():
int maxi= -1; // maximum no. of nodes that can be reached
for( int i= 1; i<= V; ++i ){ // V is total no. of Vertices
int cc;
if( !visited[i] )
cc= g.dfs( i ) ;
if( cc > maxi ){
maxi= cc;
v= i;
}
}
And the graph is :
1 2 /* 1---->2 */
2 1 /* 2---->1 */
5 3 /* 5---->3 */
3 4 /* 3---->4 */
4 5 /* 4---->5 */
What is be the problem in my dfs implementation?

You exit your dfs when you find any node that was already reached, but I have the impresion that you should run through all adjectent nodes: in your dfs function change the if statement inside for loop:
instead:
if(visited[v.first] )
return sum[v.first];
t+= 1 + dfs( v.first );
if(!visited[v.first] ) {
t+= dfs( v.first );
}
and initialize t with 1 (not 0). This way you will find size of connected component. Because you are not interested in the node from which you started then you have to decrease the final result by one.
There is one more assumption that I made: your graph is undirected. If it's directed then if you are interested in just solving the problem (not about complexity) then just clear visited and sum array after you are done with single run of dfs in main function.
EDIT
One more error in your code. Change:
for( int i= 0; i< AdjList.size(); ++i ){
into:
for( int i= 0; i< AdjList[s].size(); ++i ){
You should be able to trace segfault by yourself. Use gdb it's really usefull tool.

OpenCL pass by reference different addres space

Short storry:
I have function with pass by reference of output variables
void acum( float2 dW, float4 dFE, float2 *W, float4 *FE )
which is supposed increment variables *W, *FE, by dW, dFE if some coditions are fulfilled.
I want to make this function general - the output varibales can be either local or global.
acum( dW, dFE, &W , &FE ); // __local acum
acum( W, FE, &Wout[idOut], &FEout[idOut] ); // __global acum
When I try to compile it I got error
error: illegal implicit conversion between two pointers with different address spaces
is it possible to make it somehow??? I was thinking if I can use macro instead of function (but I'm not very familier with macros in C).
An other possibility would be probably:
to use function which returns a struct{Wnew, FEnew}
or (float8)(Wnew,FEnew,0,0) but I don't like it because it makes the code more confusing.
Naturally I also don't want to just copy the body of "acum" all over the places (like manual inlining it :) )
Backbround (Not necessary to read):
I'm trying program some thermodynamical sampling using OpenCL. Because statistical weight W = exp(-E/kT) can easily overflow float (2^64) for low temperature, I made a composed datatype W = float2(W_digits, W_exponent) and I defined functions to manipulate these numbers "acum".
I try to minimize number of global memory operations, so I let work_items go over Vsurf rather than FEout, becauese I expect that only few points in Vsurf would have considerable statistical weight, so accumulation of FEout would be called only afew times for each workitem. While iteratin over FEout would require sizeof(FEout)*sizeof(Vsurf) memory operations.
The whole code is here (any recomandations how to make it more efficinet are welcome):
// ===== function :: FF_vdW - Lenard-Jones Van Der Waals forcefield
float4 FF_vdW ( float3 R ){
const float C6 = 1.0;
const float C12 = 1.0;
float ir2 = 1.0/ dot( R, R );
float ir6 = ir2*ir2*ir2;
float ir12 = ir6*ir6;
float E6 = C6*ir6;
float E12 = C12*ir12;
return (float4)(
( 6*E6 - 12*E12 ) * ir2 * R
, E12 - E6
);}
// ===== function :: FF_spring - harmonic forcefield
float4 FF_spring( float3 R){
const float3 k = (float3)( 1.0, 1.0, 1.0 );
float3 F = k*R;
return (float4)( F,
0.5*dot(F,R)
);}
// ===== function :: EtoW - compute statistical weight
float2 EtoW( float EkT ){
float Wexp = floor( EkT);
return (float2)( exp(EkT - Wexp)
, Wexp
); }
// ===== procedure : addExpInplace -- acumulate F,E with statistical weight dW
void acum( float2 dW, float4 dFE, float2 *W, float4 *FE )
{
float dExp = dW.y - (*W).y; // log(dW)-log(W)
if(dExp>-22){ // e^22 = 2^32 , single_float = 2^+64
float fac = exp(dExp);
if (dExp<0){ // log(dW)<log(W)
dW.x *= fac;
(*FE) += dFE*dW.x;
(*W ).x += dW.x;
}else{ // log(dW)>log(W)
(*FE) = dFE + fac*(*FE);
(*W ).x = dW.x + fac*(*W).x;
(*W ).y = dW.y;
}
}
}
// ===== __kernel :: sampler
__kernel void sampler(
__global float * Vsurf, // in : surface potential (including vdW) // may be faster to precomputed endpoints positions like float8
__global float4 * FEout, // out : Fx,Fy,Fy, E
__global float2 * Wout, // out : W_digits, W_exponent
int3 nV ,
float3 dV ,
int3 nOut ,
int3 iOut0 , // shift of Fout in respect to Vsurf
int3 nCopy , // number of copies of
int3 nSample , // dimension of sampling in each dimension around R0 +nSample,-nSample
float3 RXe0 , // postion Xe relative to Tip
float EcutSurf ,
float EcutTip ,
float logWcut , // accumulate only when log(W) > logWcut
float kT // maximal energy which should be sampled
) {
int id = get_global_id(0); // loop over potential grid points
int idx = id/nV.x;
int3 iV = (int3)( idx/nV.y
, idx%nV.y
, id %nV.x );
float V = Vsurf[id];
float3 RXe = dV*iV;
if (V<EcutSurf){
// loop over tip position
for (int iz=0;iz<nOut.z;iz++ ){
for (int iy=0;iy<nOut.y;iy++ ){
for (int ix=0;ix<nOut.x;ix++ ){
int3 iTip = (int3)( iz, iy, ix );
float3 Rtip = dV*iTip;
float4 FE = 0;
float2 W = 0;
// loop over images of potential
for (int ix=0;ix<nCopy.x;ix++ ){
for (int iy=0;iy<nCopy.y;iy++ ){
float3 dR = RXe - Rtip;
float4 dFE = FF_vdW( dR );
dFE += FF_spring( dR - RXe0 );
dFE.w += V;
if( dFE.w < EcutTip ){
float2 dW = EtoW( - FE.w / kT );
acum( dW, dFE, &W , &FE ); // __local acum
}
}
}
if( W.y > logWcut ){ // accumulate force
int idOut = iOut0.x + iOut0.y*nOut.x + iOut0.z*nOut.x*nOut.y;
acum( W, FE, &Wout[idOut], &FEout[idOut] ); // __global acum
}
}}}}
}
I'm using pyOpenCL on ubuntu 12.04 64bit but I think it has nothink to do with the problem

OK, here what's happening, from the OpenCL man pages:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/global.html
"If the type of an object is qualified by an address space name, the object is allocated in the specified address name; otherwise, the object is allocated in the generic address space"
...
"The generic address space name for arguments to a function in a program, or local variables of a function is __private. All function arguments shall be in the __private address space. "
So your acum(... ) function args are in the __private address space.
So the compiler is correctly saying that
acum( ..&Wout[idOut], &FEout[idOut] )
is being called with &Wout and &FEout in the global addrress space when the function args must in the private address space.
The solution is conversion between global and private.
Create two private temporary vars to receive the results.
call acum( ... ) with these vars.
assign the temporary private values to the global values, after you've called acum( .. )
The code will look is bit messy
Remember on a GPU you have many address spaces, you can't magically jump between them by casting. You have to move data explicitly between address spaces by assignment.

Why does `clEnqueueReadBuffer` break while debugging openCL host code?

__kernel void kmeans_kernel(__global float* data, int points,
__global float* centroids, int clusters,
int dimensions)
{
//extern __shared__ float storage_space[];
__local float storage_space[];
__local int iterations;
__local float *means;
__local float *index;
__local float *mindist;
__local float *s_data;
iterations = points / ( get_global_size(0)) + 1;
if( get_local_id(0) == 0 ){
s_data[get_local_id(0)] = data[get_local_id(0)];
means=&storage_space[0];
index=&storage_space[get_local_size(0)];
mindist=&storage_space[2*get_local_size(0)];
}
//data = &data[blockDim.x * blockIdx.x + threadIdx.x];
data = &data[get_global_id(0)];
while( iterations )
{
mindist[get_local_id(0)] = 3.402823466e+38F;
index[get_local_id(0)] = 0;
for( short j = 0; j < clusters; j++ )
{
if( get_local_id(0) <= dimensions )
//means[get_local_id(0)] = centroids[get_local_id(0)+j*c_pitch];
means[get_local_id(0)] = centroids[get_local_id(0)+j];
//__syncthreads();
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
if( !(data[get_local_id(0)] - s_data[get_local_id(0)] > points - 1) )
{
float dist = distance_gpu_transpose( means, data, dimensions);
if( dist < mindist[get_local_id(0)] )
{
mindist[get_local_id(0)] = dist;
index[get_local_id(0)] = j;
}
}
//__syncthreads();
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
}
if( !(data[get_local_id(0)] - s_data[get_local_id(0)] > points - 1) )
data[0] = index[get_local_id(0)];
data += (get_global_size(0));
if( get_local_id(0) == 0 )
--iterations;
//__syncthreads();
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
}
}
I am using AMD processor with ATI 3200 graphics card, it does not support openCL but the rest of the code is running fine on CPU itself.
The problem with my code is quite complicated for me this time. After the kernel got executed, I am unable to read the device memory variable using clEnqueueReadBuffer. while dubugging it's breaking at this point and says,
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
When i press break here, it gives
No symbols are loaded for any call stack frame. The source code cannot be displayed.
What might be the problem here? please suggest me some solution.
my kernel code is as given above and the statement i use to read data is,
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
How can I check the possible errors here?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Border exchange in cartesian topology mpi - mpi

Related

Recursive FIbonacci arm Assembly

Is there a Intel SIMD comparison function that returns 0 or 1 instead of 0 or 0xFFFFFFFF?

Maximum no. of nodes reachable from a given source in a Graph

OpenCL pass by reference different addres space

Why does `clEnqueueReadBuffer` break while debugging openCL host code?

Categories

Resources