I am a new user of MVAPICH2, and I encountered troubles when I started with it.
First, I think I have installed it successfully, through this:
./configure --disable-fortran --enable-cuda
make -j 4
make install
There were not errors.
But when I attempted to run the example of cpi in the directory of example, I encountered like this:
I could connect node gpu-cluster-1 and gpu-cluster-4 through ssh without password;
I run the cpi example separately on gpu-cluster-1 and gpu-cluster-4 using mpirun_rsh, and it worked OK, just like this:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-1
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000089
run#gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
Process 0 of 2 is on gpu-cluster-4
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000134
I run the cpi example both on gpu-cluster-1 and gpu-cluster-4 using mpiexec, and it worked OK, just like this:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
Process 0 of 2 is on gpu-cluster-1
Process 1 of 2 is on gpu-cluster-4
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000352
The content in hostfile is "gpu-cluster-1\ngpu-cluster-4"
But, when I run cpi example, using mpirun_rsh, borh on gpu-cluster-1 and gpu-cluster-4, problem came out:
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 -hostfile hostfile ./cpi
Process 1 of 2 is on gpu-cluster-4
-----------------It stuck here, not going on ------------------------
After a long time, I press Ctrl + C, and it present this:
^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2, killing job
run#gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu-cluster-4:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
[gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection refused (111)
I have been confused for a long time, could you give me some help to resolve this problems?
Here is the code of cpi example:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
double f(double);
double f(double a)
{
return (4.0 / (1.0 + a*a));
}
int main(int argc,char *argv[])
{
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
fprintf(stdout,"Process %d of %d is on %s\n",
myid, numprocs, processor_name);
fflush(stdout);
n = 10000; /* default # of rectangles */
if (myid == 0)
startwtime = MPI_Wtime();
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double) n;
sum = 0.0;
/* A slightly better approach starts from large i and works back */
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
endwtime = MPI_Wtime();
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
printf("wall clock time = %f\n", endwtime-startwtime);
fflush(stdout);
}
MPI_Finalize();
return 0;
}
Related
I'm new to CUDA.
I want to copy and sum values in device_vector in the following ways. Are there more efficient ways (or functions provided by thrust) to implement these?
thrust::device_vector<int> device_vectorA(5);
thrust::device_vector<int> device_vectorB(20);
copydevice_vectorA 4 times into device_vectorB in the following way:
for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
thrust::copy(device_vectorA.begin(), device_vectorA.end(), device_vectorB.begin() + offset_sta);
}
Sum every 5 values in device_vectorB and store the results in new device_vector (size 4):
// Example
device_vectorB = 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5 | 1 2 3 4 5
device_vectorC = 15 15 15 15
thrust::device_vector<int> device_vectorC(4);
for (size_t i = 0; i < 4; i++)
{
offset_sta = i * 5;
offset_end = (i + 1) * 5 - 1;
device_vectorC[i] = thrust::reduce(device_vectorB.begin() + offset_sta, device_vectorB.begin() + offset_end, 0);
}
Are there more efficient ways (or functions provided by thrust) to implement these?
P.S. 1 and 2 are separate instances. For simplicity, these two instances just use the same vectors to illustrate.
Step 1 can be done with a single thrust::copy operation using a permutation iterator that uses a transform iterator working on a counting iterator to generate the copy indices "on the fly".
Step 2 is a partitioned reduction, using thrust::reduce_by_key. We can again use a transform iterator working on a counting iterator to create the flags array "on the fly".
Here is an example:
$ cat t2124.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
const int As = 5;
const int Cs = 4;
const int Bs = As*Cs;
int main(){
thrust::device_vector<int> A(As);
thrust::device_vector<int> B(Bs);
thrust::device_vector<int> C(Cs);
thrust::sequence(A.begin(), A.end(), 1); // fill A with 1,2,3,4,5
thrust::copy_n(thrust::make_permutation_iterator(A.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1%A.size())), B.size(), B.begin()); // step 1
auto my_flags_iterator = thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/A.size());
thrust::reduce_by_key(my_flags_iterator, my_flags_iterator+B.size(), B.begin(), thrust::make_discard_iterator(), C.begin()); // step 2
thrust::host_vector<int> Ch = C;
thrust::copy_n(Ch.begin(), Ch.size(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2124 t2124.cu
$ compute-sanitizer ./t2124
========= COMPUTE-SANITIZER
15,15,15,15,
========= ERROR SUMMARY: 0 errors
$
If we wanted to, even the device vector A could be dispensed with; that could be created "on the fly" using a counting iterator. But presumably your inputs are not actually 1,2,3,4,5
I'm trying to synchronize my processes at the beginning of the execution via MPI_Barrier but my program gets blocked on it. Nevertheless, I see all the processes are reaching that line by printing on the screen in the previous instruction.
int num_processes, packet_size, partner_rank;
double start_comm_time, end_comm_time, comm_time;
comm_time = 0;
if(argc==3) {
if (sscanf (argv[1], "%i", &num_processes) != 1) {
fprintf(stderr, "error - parameter 1 not an integer");
} else;
if (sscanf (argv[2], "%i", &packet_size) != 1) {
fprintf(stderr, "error - parameter 2 not an integer");
} else;
}
else {
printf("\n Usage: broadcast $count $packet_size");
return 0;
}
char buf_send[packet_size], buf_recv[packet_size];
buf_send[0] = 0;
// Initialize
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
printf("\n Comm size %d \n", world_size);
printf("\n Process %d before barrier \n", world_rank);
// Time MPI_Bcast
MPI_Barrier(MPI_COMM_WORLD);
printf("\n Process %d after barrier \n", world_rank);
start_comm_time = MPI_Wtime();
MPI_Bcast(buf_send, packet_size, MPI_CHAR, 0, MPI_COMM_WORLD);
printf("\n Process %d before second barrier \n", world_rank);
MPI_Barrier(MPI_COMM_WORLD);
end_comm_time = MPI_Wtime();
This is what I get in the printout:
Comm size 6
Process 0 before barrier
Comm size 6
Process 2 before barrier
Comm size 6
Process 3 before barrier
Comm size 6
Process 1 before barrier
Comm size 6
Process 4 before barrier
Comm size 6
Process 5 before barrier
I removed the timing and MPI_Bcast parts of your program (because it is incomplete), and it runs well on my machine, without getting blocked. So there could be something wrong with other parts of your code.
According to your whole program, I don't think your code has some problems. The problem is probably your mpi environment. I also run it on my machine using mpirun -n 6 ./xxx 6 1, and this is what I get:
Comm size 6
Process 0 before barrier
Comm size 6
Process 5 before barrier
Comm size 6
Process 2 before barrier
Comm size 6
Comm size 6
Process 1 before barrier
Process 3 before barrier
Comm size 6
Process 4 before barrier
Process 0 after barrier
Process 0 before second barrier
Process 1 after barrier
Process 1 before second barrier
Process 4 after barrier
Process 4 before second barrier
Process 5 after barrier
Process 5 before second barrier
Process 2 after barrier
Process 2 before second barrier
Process 3 after barrier
Process 3 before second barrier
i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/
I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.
CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
here is my NDRange code
queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);
and my kernel code:
__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve[n] ^= true;
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve[n] ^= true;
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve[n] ^= true;
for (int r = 5; r * r < limit; r++) {
if (sieve[r]) {
for (int i = r * r; i < limit; i += r * r)
sieve[i] = false;
}
}
}
You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.
I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.
As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?
Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .
I'm trying to perform a fractal picture parallel calcul with mpi.
I've divide my program in 4 part :
Balance the number of row treat by each rank
Perform the calcul on each row attribute to the rank
Sending the number of row and the rows to the rank 0
Treat the data in rank 0 (for the test just print the int)
The step 1 and 2 are working but when i'm trying to send the rows to rank 0 the program is stoping and block. I know that the MPI_Send could Block bu there is no reason for that here.
Here is the 2 first step:
Step 1 :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Include the MPI library for function calls */
#include <mpi.h>
/* Define tags for each MPI_Send()/MPI_Recv() pair so distinct messages can be
* sent */
#define OTHER_N_ROWS_TAG 0
#define OTHER_PIXELS_TAG 1
int main(int argc, char **argv) {
const int nRows = 513;
const int nCols = 513;
const int middleRow = 0.5 * (nRows - 1);
const int middleCol = 0.5 * (nCols - 1);
const double step = 0.00625;
const int depth = 100;
int pixels[nRows][nCols];
int row;
int col;
double xCoord;
double yCoord;
int i;
double x;
double y;
double tmp;
int myRank;
int nRanks;
int evenSplit;
int nRanksWith1Extra;
int myRow0;
int myNRows;
int rank;
int otherNRows;
int otherPixels[nRows][nCols];
/* Each rank sets up MPI */
MPI_Init(&argc, &argv);
/* Each rank determines its ID and the total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
printf("My rank is %d \n",myRank);
evenSplit = nRows / nRanks;
nRanksWith1Extra = nRows % nRanks;
/*Each rank determine the number of rows that he will have to perform (well balanced)*/
if (myRank < nRanksWith1Extra) {
myNRows = evenSplit + 1;
myRow0 = myRank * (evenSplit + 1);
}
else {
myNRows = evenSplit;
myRow0 = (nRanksWith1Extra * (evenSplit + 1)) +
((myRank - nRanksWith1Extra) * evenSplit);
}
/*__________________________________________________________________________________*/
Step 2 :
/*_____________________PERFORM CALCUL ON EACH PIXEL________________________________ */
for (row = myRow0; row < myRow0 + myNRows; row++) {
/* Each rank loops over the columns in the given row */
for (col = 0; col < nCols; col++) {
/* Each rank sets the (x,y) coordinate for the pixel in the given row and
* column */
xCoord = (col - middleCol) * step;
yCoord = (row - middleRow) * step;
/* Each rank calculates the number of iterations for the pixel in the
* given row and column */
i = 0;
x = 0;
y = 0;
while ((x*x + y*y < 4) && (i < depth)) {
tmp = x*x - y*y + xCoord;
y = 2*x*y + yCoord;
x = tmp;
i++;
}
/* Each rank stores the number of iterations for the pixel in the given
* row and column. The initial row is subtracted from the current row
* so the array starts at 0 */
pixels[row - myRow0][col] = i;
}
//printf("one row performed by %d \n",myRank);
}
printf("work done by %d \n",myRank);
/*_________________________________________________________________________________*/
Step 3:
/*__________________________SEND DATA TO RANK 0____________________________________*/
/* Each rank (including Rank 0) sends its number of rows to Rank 0 so Rank 0
* can tell how many pixels to receive */
MPI_Send(&myNRows, 1, MPI_INT, 0, OTHER_N_ROWS_TAG, MPI_COMM_WORLD);
printf("test \n");
/* Each rank (including Rank 0) sends its pixels array to Rank 0 so Rank 0
* can print it */
MPI_Send(&pixels, sizeof(int)*myNRows * nCols, MPI_BYTE, 0, OTHER_PIXELS_TAG,
MPI_COMM_WORLD);
printf("enter ranking 0 \n");
/*_________________________________________________________________________________*/
Step 4:
/*________________________TREAT EACH ROW IN RANK 0_________________________________*/
/* Only Rank 0 prints so the output is in order */
if (myRank == 0) {
/* Rank 0 loops over each rank so it can receive that rank's messages */
for (rank = 0; rank < nRanks; rank++){
/* Rank 0 receives the number of rows from the given rank so it knows how
* many pixels to receive in the next message */
MPI_Recv(&otherNRows, 1, MPI_INT, rank, OTHER_N_ROWS_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 receives the pixels array from each of the other ranks
* (including itself) so it can print the number of iterations for each
* pixel */
MPI_Recv(&otherPixels, otherNRows * nCols, MPI_INT, rank,
OTHER_PIXELS_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 loops over the rows for the given rank */
for (row = 0; row < otherNRows; row++) {
/* Rank 0 loops over the columns within the given row */
for (col = 0; col < nCols; col++) {
/* Rank 0 prints the value of the pixel at the given row and column
* followed by a comma */
printf("%d,", otherPixels[row][col]);
}
/* In between rows, Rank 0 prints a newline character */
printf("\n");
}
}
}
/* All processes clean up the MPI environment */
MPI_Finalize();
return 0;
}
I would like to understand why does it blocking , could you explain me ?
I'm a new user of MPI and i would like to learn it not just to have a program that is working.
Thank you in advance.
MPI_Send is by definition of the standard a blocking operation.
Note that blocking means:
it does not return until the message data and envelope have been safely stored away so that the sender is free to modify the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.
Trying to have a rank send messages to itself with MPI_Send and MPI_Recv is a deadlock.
The idiomatic pattern for your situation is to use the appropriate collective communication operations MPI_Gather and MPI_Gatherv.
When you use blocking send/recv constructs when sending to the rank 0 itself, it might cause a deadlock.
From the MPI 3.0 standard, Section 3.2.4:
Source = destination is allowed, that is, a process can send a message to itself. (However, it is unsafe to do so with the blocking send and receive operations described above,
since this may lead to deadlock. See Section 3.5.)
Possible solutions:
Use non-blocking send/recv constructs when sending/receiving to/from rank 0 itself. For more information, take a look at the MPI_Isend, MPI_Irecv and MPI_Wait routines.
Eliminate communication with rank 0 itself. Since you are in rank 0, you already have a way to know how many pixels you have to compute.
As explained in a previous answer, MPI_Send() might block.
From a theoretical MPI point of view, your application is incorrect because of a potential deadlock (rank 0 MPI_Send() to itself when no receive is posted).
From a very pragmatic point of view, MPI_Send() generally returns immediately when a small message is sent (such as myNRows), but blocks until a matching receive is posted when a large message is sent (such as pixels). Please keep in mind
small and large depend at least on both the MPI library and the interconnect being used
it is incorrect from a MPI point of view to assume that MPI_Send() will return immediately for small messages
If you really want to make sure your application is deadlock free, you can simply replace MPI_Send() with MPI_Ssend().
Back to your question, there are several options here
revamp your app so rank 0 does not communicate with itself (all the info is available, so no communication is needed
post a MPI_Irecv() before MPI_Send(), and replace MPI_Recv(source=0) with MPI_Wait()
revamp you app so rank 0 does not MPI_Send() nor MPI_Recv(source=0), but MPI_Sendrecv instead. This is my recommended option since you only have to make a small change to the communication pattern (the computation pattern is kept untouched) which is more elegant imho.
I am trying to solve the grid unique paths problem. The problem involves finding the number of possible unique paths in a 2D grid starting from top left (0,0) to the bottom right (say A,B). One can only move right or down. Here is my initial attempt:
#include <stdio.h>
int count=0;
void uniquePathsRecur(int r, int c, int A, int B){
if(r==A-1 & c==B-1){
count++;
return;
}
if(r<A-1){
return uniquePathsRecur(r++,c,A,B);
}
if(c<B-1){
return uniquePathsRecur(r,c++,A,B);
}
}
int uniquePaths(int A, int B) {
if(B==1 | A==1){
return 1;
}
uniquePathsRecur(0,0,A,B);
return count;
}
int main(){
printf("%d", uniquePaths(5,3));
return 0;
}
I end up getting segmentation fault: 11 with my code. I tried to debug in gdb and i get the following:
lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) r
Process 12171 launched: '<path to process>/a.out' (x86_64)
Process 12171 stopped
* thread #1: tid = 0x531b2e, 0x0000000100000e38 a.out`uniquePathsRecur + 8, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7fff5f3ffffc)
frame #0: 0x0000000100000e38 a.out`uniquePathsRecur + 8
a.out`uniquePathsRecur:
-> 0x100000e38 <+8>: movl %edi, -0x4(%rbp)
0x100000e3b <+11>: movl %esi, -0x8(%rbp)
0x100000e3e <+14>: movl %edx, -0xc(%rbp)
0x100000e41 <+17>: movl %ecx, -0x10(%rbp)
(lldb)
What is wrong with the above code?
I don't know the problem of your code. But you can solve the problem without using recursion.
Method 1: We can solve this problem with simple math skill. The
requirement is that you can only move either down or right at any
point. So it requires exact (m + n) steps from S to D and n out of (m
+ n) steps go down. Thus, the answer is C(m + n, n).
Method 2: Let us solve the issue in computer science way. This is a typical dynamic
programming problem. Let us assume the robot is standing at (i, j).
How did the robot arrive at (i, j)? The robot could move down from (i
- 1, j) or move right from (i, j - 1). So the path to (i, j) is equal to the sum of path to (i - 1, j) and path to (i, j - 1). We can use
another array to store the path to all node and use the equation
below: paths(i, j) = 1 // i == 0 or j == 0 paths(i, j) = paths(i - 1,
j) + paths(i, j - 1) // i != 0 and j != 0 However, given more
thoughts, you will find out that you don't actually need a 2D array to
record all the values since when the robot is at row i, you only need
the paths at (i - 1). So the equation is: paths(j) = 1 //j == 0 for
any i paths(j) = paths(j - 1) + paths(j) // j != 0 for any i
For more information, please see here: https://algorithm.pingzhang.io/DynamicProgramming/unique_path.html