I wanna do matrix vector multiplication. The code is compiling but not running. Can anyone please help me out in solving the problem? Thank you in advance.
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <time.h>
#define DIM 500
int main(int argc, char *argv[])
int i, j, n=10000;
int nlocal; /* Number of locally stored rows of A */
double *fb;
double a[DIM * DIM], b[DIM], x[DIM]; /* Will point to a buffer that stores the entire vector b */
int npes, myrank;
MPI_Status status;
MPI_Init(&argc, &argv);
/* Get information about the communicator */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
/* Allocate the memory that will store the entire vector b */
fb = (double*)malloc(npes * sizeof(double));
nlocal = n / npes;
/* Gather the entire vector b on each processor using MPI's ALLGATHER operation */
MPI_Allgather(b, nlocal, MPI_DOUBLE, fb, nlocal, MPI_DOUBLE, MPI_COMM_WORLD);
/* Perform the matrix-vector multiplication involving the locally stored submatrix */
for (i = 0; i < nlocal; i++) {
x[i] = 0.0;
for (j = 0; j < n; j++)
x[i] += a[i * n + j] * fb[j];
} //end main
Please help me out in running the code. Thanks.
The issue may come from fb = (double*)malloc(npes * sizeof(double)); which should be fb = (double*)malloc(n * sizeof(double));. Indeed, npes is the number of processes and n is the total length of the vector.
Moreover, the array a is of size 500x500=250000. This is enough to store 25 rows if n=10000... Are you using 400 processes ? If you are using less that 400 processes a[i * n + j] is an attempt to read after the end of the array. It triggers undefined behaviors, such as a segmentation fault.
Last a is a large array and since it is declared as double a[500*500], it is allocated on the stack. Read : Segmentation fault on large array sizes : the best way to go is to use malloc() for a as well, with appropiate size (here nlocal*n).
double *a=malloc(nlocal*n*sizeof(double));
if(a==NULL){fprintf(stderr,"process %d : malloc failed\n",npes);exit(1);}
n=10000 is rather large. Consider using computed numbers such as nlocal*n for the size of the array a, not default sizes such as DIM. That way, you will be able to debug your code on smaller n and memory will not be wasted.
The same comments apply to b and x allocated as double b[500] and double x[500] while much larger arrays are needed if n=10000. Once again, consider using malloc() with the appropriate number, not a defined value DIM=500 !
double *b=malloc(n*sizeof(double));
if(b==NULL){fprintf(stderr,"process %d : malloc failed\n",npes);exit(1);}
double *x=malloc(nlocal*sizeof(double));
if(x==NULL){fprintf(stderr,"process %d : malloc failed\n",npes);exit(1);}
A debugger such as valgrind can detect such problems related to memory management. Try it on your program using a single process !
I have been searching about this for so long, but i am not able to understand what this question means.
Write a program in any language to determine how your computer handles graceful
I understand that a overflow condition is something like this:
if an integer can store a maximum value of x and if we assign a value of x+1, the value x+1 will be converted to the the lowest value the integer can hold. I understand that underflow is just the reverse.
How does it stand from High performance scientific computing / Linear algebra point of view ?
I have read this link , but i think it's the same underflow/ overflow stuff that i mentioned above. What does the graceful underflow stand for?
Okay,as per the link posted by #StoneBird in this link was particularly helpful. Here i have created a program in c that demonstrates the same.
#include <stdio.h>
#include <math.h>
int main(int argc, char **argv)
unsigned int s,e,m;
unsigned int* ptr;
float a,temp=0;
float min=pow(2,-129);
ptr=(unsigned int*)&temp;
s = *ptr >> 31;
e = *ptr & 0x7f800000;
e >>= 23;
m = *ptr & 0x07fffff;
printf("sign = %x\n",s);
printf("exponent = %x\n",e);
printf("mantissa = %x\n",m);
return 0;
Here the min variable is used to change the final number...i used min=pow(2,-129), pow(2,-128) and pow(2,-130) to see the results and the saw the Denormal number appear.This wiki page explains it all.
I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
int i = get_global_id(0);
result[i] = x[i] * y[i];
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
perror("Couldn't open the source file");
return 1;
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
coeffs[i] = 1.000001;
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
values[i] = static_cast<float> (i);
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
I am working on OpenCL code for sparse matrix operations and I find that it works when the code including the kernel is executed once or twice. But every few runs the answer is slightly off. Here is the very simple kernel I am using:
__kernel void dsmv( int N, __global int * IA,
__global int * JA, __global float * A,
__global float * X, __global float * Y){
int IBGN, ICOL, IEND, ii;
ICOL = get_global_id(0);
if(ICOL < N)
IEND = JA[ICOL+1]-1-1;
for (ii = IBGN; ii <= IEND; ii++)
Y[IA[ii]-1] += A[ii]*X[ICOL];
I can also post the fortran code that uses this kernel. I am using FortranCL.
What could cause the multiplication to give different answers from run to run?
This line looks suspicious:
Y[IA[ii]-1] += A[ii]*X[ICOL];
It seems that two work items may increment the same memory location, so there is a potential race condition here, and since += is not an atomic operation this is a problem.
Unfortunately you can't use the built-in atomic_add instead because it doesn't support floats, but atomic_cmpxchg does, so you can use it to implement a floating-point atomic add - or just look at this existing implementation of an atomic add for floats.
I have a problem using a buffer of bytes in global memory to store some integer of various size (8 bits, 16 bits, 32 bits, 64 bits).
If i store an integer at an pointer value non multiple of 4 bytes (for instance because i just stored a 8bit integer), the adress is rounded down, erasing the previous data.
__global__ void kernel(char* pointer)
In this example code, using any of : (pointer), (pointer+1), (pointer+2), (pointer+3) the integer is stored at (pointer), considering pointer is a multiple of 4.
Is cuda memory organised in 32 bit blocks at the hardware level ?
Is there any way to make this work ?
The word size alignment is non-negotiable in CUDA. However, if you're willing to take the performance hit for some reason, you could pack your data into char * and then just write your own custom storage function, e.g.
__inline __device__ void Assign(int val, char * arr, int len)
for (int idx = 0; idx < len; idx++)
*(arr+idx)=(val & (0xFF<<(idx<<8))
__inline __device__ int Get(char * arr, int idx, int len)
int val;
for (int idx = 0; idx < len; idx++)
return val;
Hope that helps!
My first thought was MPI_Scatter and send-buffer allocation should be used in if(proc_id == 0) clause, because the data should be scattered only once and each process needs only a portion of data in send-buffer, however it didn't work correctly.
It appears that send-buffer allocation and MPI_Scatter must be executed by all processes before the application goes right.
So I wander, what's the philosophy for the existence of MPI_Scatter since all processes have access to the send-buffer.
Any help will be grateful.
Code I wrote like this:
if (proc_id == 0) {
int * data = (int *)malloc(size*sizeof(int) * proc_size * recv_size);
for (int i = 0; i < proc_size * recv_size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), recv_size, MPI_INT, &recievedata, recv_size, MPI_INT, 0, MPI_COMM_WORLD);
I thought, that's enough for root processes to scatter data, what the other processes need to do is just receiving data. So I put MPI_Scatter, along with send buffer definition & allocation, in the if(proc_id == 0) statement. No compile/runtime error/warning, but the receive buffer of other processes didn't receive it's corresponding part of data.
Your question isn't very clear, and would be a lot easier to understand if you showed some code that you were having trouble with. Here's what I think you're asking -- and I'm only guessing this because this is an error I've seen people new to MPI in C make.
If you have some code like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
ierr = MPI_Scatter(&(data[0]), 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
why doesn't it work, and why do you get a segmentation fault? Of course the other processes don't have access to data; that's the whole point.
The answer is that in the non-root processes, the sendbuf argument (the first argument to MPI_Scatter()) isn't used. So the non-root processes don't need access to data. But you still can't go around dereferencing a pointer that you haven't defined. So you need to make sure all the C code is valid. But data can be NULL or completely undefined on all the other processes; you just have to make sure you're not accidentally dereferencing it. So this works just fine, for instance:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv) {
int proc_id, size, ierr;
int *data;
int recievedata;
ierr = MPI_Init(&argc, &argv);
ierr|= MPI_Comm_size(MPI_COMM_WORLD,&size);
ierr|= MPI_Comm_rank(MPI_COMM_WORLD,&proc_id);
if (proc_id == 0) {
data = (int *)malloc(size*sizeof(int));
for (int i=0; i<size; i++) data[i] = i;
} else {
data = NULL;
ierr = MPI_Scatter(data, 1, MPI_INT,
&recievedata, 1, MPI_INT, 0, MPI_COMM_WORLD);
printf("Rank %d recieved <%d>\n", proc_id, recievedata);
if (proc_id == 0) free(data);
ierr = MPI_Finalize();
return 0;
If you're using "multidimensional arrays" in C, and say scattering a row of a matrix, then you have to jump through an extra hoop or two to make this work, but it's still pretty easy.
Note that in the above code, all routines called Scatter - both the sender and the recievers. (Actually, the sender is also a receiver).
In the message passing paradigm, both the sender and the receiver have to cooperate to send data. In principle, these tasks could be on different computers, housed perhaps in different buildings -- nothing is shared between them. So there's no way for Task 1 to just "put" data into some part of Task 2's memory. (Note that MPI2 has "one sided messages", but even that requires a significant degree of cordination between sender and reciever, as a window has to be put asside to push data into or pull data out of).
The classic example of this is send/recieve pairs; it's not enough that (say) process 0 sends data to process 3, process 3 also has to recieve data.
The MPI_Scatter function contains both send and recieve logic. The root process (specified here as 0) sends out the data, and all the recievers recieve; everyone participating has to call the routine. Scatter is an example of an MPI Collective Operation, where all tasks in the communicator have to call the same routine. Broadcast, barrier, reduction operations, and gather operations are other examples.
If you have only process 0 call the scatter operation, your program will hang, waiting forever for the other tasks to participate.