I wish to calculate the speedup of my MPI application against the number of parallel processes/nodes.
Application mostly performs huge matrices computation in parallel.
I can measure an elapsed time using MPI_Wtime(), something like this..
double start = MPI_Wtime();
....
double end = MPI_Wtime();
double elapsed = end - start;
But how can I achieve this against the degree of parallelization ?
The usual definition of speedup is time on 1 process divided by time on p processes.
If you wish to present the performance of your code, it's good to pick a range of p from 1 to the highest amount you have access to run and plot the results on a speedup vs. p plot.
Note that strictly speaking, speedup should compare the time on p processes vs the best possible sequential code, not just running your parallel code sequentially. This seems like a moot point, but in some areas the parallel codes are pretty awful in the sequential case. In the sparse matrix world, for example, you can find a parallel code 10-50x slower than the top sequential code.
Related
I have a computation loop with adaptive time stepping and I need to store the results at each iteration. In other words, I do not know the vector size before the computation, so I can not preallocate a vector size to store the data. Right now, I build a vector using the push! function
function comp_loop()
clock = [0.0]
data = [0.0]
while run_time < model_time
# Calculate new timestep
timestep = rand(Float64) # Be sure to add Random
run_time += timestep
# Build vector
push!(clock,run_time)
push!(data,timestep)
end
end
Is there a more efficient way to go about this? Again, I know the optimal choice is to preallocate the vector, but I do not have that luxury available. Buffers are theoretically not an option either, as I don't now how large to make them. I'm looking for something more "optimal" on how to implement this in Julia (i.e. maybe some advanced application available in the language).
Theoretically, you can use a linked list such as the one from DataStructures.jl to get O(1) appending. And then optionally write that out to a Vector afterwards (probably in reverse order, though).
In practise, push!ing to a Vector is often efficient enough -- Vectors use a doubling strategy to manage their dynamic size, which leads to amortized constant time and the advantage of contiguous memory access.
So you could try the linked list, but be sure to benchmark whether it's worth the effort.
Now, the above is about time complexity. When you care about allocation, the argument is quite similar; with a vector, you are going to end up with memory proportional to the next power of two after your actual requirements. Most often, that can be considered amortized, too.
I am doing a learning exercise. I want to calculate the number of primes in a range from 0 to N. What function of mpi can I use to distribute ranges of numbers to each process? In other words, each process calculates the number of primes within a number range of the main range.
You could simply use a for loop and MPI_Send on the root (and MPI_Recv on receivers) to send to each process the number at which it should start and how many numbers it should check.
Another possibility, even better, is to send N to each process with MPI_Bcast (On root and receivers) and let each process compute which numbers it should check using it's own rank. (something like start=N/MPI_Comm_size*MPI_Comm_rank and length=N/MPI_Comm_size, and some adequate rounding etc.)
You can probably optimize load balancing even more but you should get it working first.
At the end you should call MPI_Reduce with a sum.
CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.
I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.
If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?
If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)
Assume that MPI_Reduce is always faster than MPI_Gather + local reduce.
Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.
MPI_Reduce has only advantages over MPI_Gather + local reduce:
MPI_Reduce is the more high-level operation giving the implementation more opportunity to optimize.
MPI_Reduce needs to allocate much less memory
MPI_Reduce needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)
MPI_Reduce can distribute the computation across more resources (e.g. using a tree communication pattern)
That said:
Never assume anything about performance. Measure.
I have heard that writing for loops in R is particularly slow. I have the following code which needs to run through 122,000 rows with each having 513 columns and transform them using fft() function:
for (i in 2:100000){
Data1[i,2:513]<- fft(as.numeric(Data1[i,2:513]), inverse = TRUE)/512
}
I have tried to do this for 1000 cycles and that took few minutes... is there a way to do this loop faster? Maybe by not using a loop or by doing it in C?
mvfft (documented on the fft help page) was designed to do this all at once. It's hard to imagine how you could do it any faster: less than three seconds (on an older Xeon workstation) for a dataset exactly your size.
n.row <- 122e3
X <- matrix(rnorm(n.row*512), n.row)
system.time(
Y <- mvfft(t(X), inverse=TRUE)/512
)
user system elapsed
2.34 0.39 2.75
Note that the discrete FFT in this case has complex values.
FFTs are fast. Typically they can be computed in less time than it takes to read data from an ASCII file (because the character-to-numeric conversions involved in the read take more time than the calculations in the FFT). Your limiting resources therefore are I/O throughput speed and RAM. But 122,000 vectors of 512 complex values occupy "only" about a gigabyte, so you should be ok.
I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.