I am doing a learning exercise. I want to calculate the number of primes in a range from 0 to N. What function of mpi can I use to distribute ranges of numbers to each process? In other words, each process calculates the number of primes within a number range of the main range.
You could simply use a for loop and MPI_Send on the root (and MPI_Recv on receivers) to send to each process the number at which it should start and how many numbers it should check.
Another possibility, even better, is to send N to each process with MPI_Bcast (On root and receivers) and let each process compute which numbers it should check using it's own rank. (something like start=N/MPI_Comm_size*MPI_Comm_rank and length=N/MPI_Comm_size, and some adequate rounding etc.)
You can probably optimize load balancing even more but you should get it working first.
At the end you should call MPI_Reduce with a sum.
Related
I have a computation loop with adaptive time stepping and I need to store the results at each iteration. In other words, I do not know the vector size before the computation, so I can not preallocate a vector size to store the data. Right now, I build a vector using the push! function
function comp_loop()
clock = [0.0]
data = [0.0]
while run_time < model_time
# Calculate new timestep
timestep = rand(Float64) # Be sure to add Random
run_time += timestep
# Build vector
push!(clock,run_time)
push!(data,timestep)
end
end
Is there a more efficient way to go about this? Again, I know the optimal choice is to preallocate the vector, but I do not have that luxury available. Buffers are theoretically not an option either, as I don't now how large to make them. I'm looking for something more "optimal" on how to implement this in Julia (i.e. maybe some advanced application available in the language).
Theoretically, you can use a linked list such as the one from DataStructures.jl to get O(1) appending. And then optionally write that out to a Vector afterwards (probably in reverse order, though).
In practise, push!ing to a Vector is often efficient enough -- Vectors use a doubling strategy to manage their dynamic size, which leads to amortized constant time and the advantage of contiguous memory access.
So you could try the linked list, but be sure to benchmark whether it's worth the effort.
Now, the above is about time complexity. When you care about allocation, the argument is quite similar; with a vector, you are going to end up with memory proportional to the next power of two after your actual requirements. Most often, that can be considered amortized, too.
Consensus on set.seed in R is that that it effectively generates a long sequence of pseudo-random numbers, pre-determined by the seed. Then the first call you make to this sequence (with the first non-deterministic function you use) takes the first batch from that sequence, the second call takes the next batch, so forth.
I am wondering what the limits to this are. Specifically, what happens when you get to the end of that long sequence? Let's say, after setting a seed, you then sample from the first 100 integers repeatedly. Would there come a point where you start generating the same samples (in the same order) as you were seeing at the beginning? How long would this take? (Does it depend on the seed?) If not, how would reaching the 'end' of the sequence and presumably circling back to the beginning manifest?
The ?RNGkind help page in R gives more details on the default random number generator, the "Mersenne Twister" algorithm:
"Mersenne-Twister": From Matsumoto and Nishimura (1998); code
updated in 2002. A twisted GFSR with period 2^19937 - 1 and
equidistribution in 623 consecutive dimensions (over the
whole period). The ‘seed’ is a 624-dimensional set of 32-bit
integers plus a current position in that set.
As stated there, the "period" (the length of time it takes to get back to the beginning and start repeating values is 2^19937-1, or approximately 10^(19937/log2(10)) = 10^6001.
If the size of your "batches" happened to line up exactly with the period, then you would indeed start getting the same batches again.
I'm not sure how many pseudorandom samples R uses to pick a sample of size 1 from a set. Ideally it would be only 1 (so your "batch size" would be 1), but it might be more depending on the generality/complexity of the sampling algorithm.
I know that runif() translates more or less directly from the PRNG, so a sequence of runif() calls would indeed repeat exactly.
For n queries I am given a number x and I have to print its factorial under modulo 1000000007.
def fact_eff(n, d):
if n in d:
return d[n]
else:
ans=n*fact_eff(n-1,d)
d[n]=ans
return ans
d={0:1}
n=int(input())
while(n!=0):
x=int(input())
print(fact_eff(x, d)%1000000007)
n=n-1
The problem is that x can be as large as 100000 and I receive runtime error for values greater than 3000 as maximum recursion depth exceeds. Am I missing something with the modulus operator?
Why would you use recursion in the first place to compute a simple factorial? You can check the dictionary in a loop. Or better, start at the highest valid memoized position and go higher from there, creating new entries as you go.
To save space, maybe only record n! every 32 iterations or something, so future calls need at most 31 multiplies. Still O(1) but trading some computation for huge space savings.
Also, does it work to apply the modulus before you get the final huge product? Like every few multiply steps to keep the numbers small? Or every single step if that keeps the numbers small enough for CPython's single-limb fast path. I think (x * y) % n = ((x%n) * y) % n. (But I didn't double-check that.)
If so, you could combine early modulo with sparse memoization to memoize the final modulo-reduced result.
(For numbers above 2^30, Python BigInteger multiply cost should scale with number of 2^30 chunks required to represent the number. Fortunately one of the multiplicands is always small, being the counter. Keeping the product small buys speed, but division is expensive so it's a tradeoff. And doing any more operations costs Python interpreter overhead which may simply dominate anyway until numbers get really huge.)
Trying to figure out the difference between All-to-All Reduction and All-Reduce in open MPI. From my understanding All-to-One Reduction takes a piece m (integer, array, etc..) from all processes and combines all the pieces together with an operator (min, max, sum, etc..) and stores it in the selected process. From this i assume that All-to-All Reduction is the same but the product is stored in all the processes instead of just one. From this document it seems like All-Reduce is basically doing the same as All-to-All Reduction, is this right or am i getting it wrong?
The all-reduce (MPI_Allreduce) is a combined reduction and broadcast (MPI_Reduce, MPI_Bcast). They might have called it MPI_Reduce_Bcast. It is important to note that a MPI reduction does not do any global reduction. So if have 10 numbers each on 5 processes, after a MPI_Reduce one process has 10 numbers. After MPI_Allreduce, all 5 processes have the same 10 numbers.
In contrast, the all-to-all reduction performs a reduction and scatter, hence it is called MPI_Reduce_scatter[_block]. So if you have 10 numbers each on 5 processes, after a MPI_Reduce_scatter_block, the 5 processes have 2 numbers each. Note that MPI doesn't itself use the terminology all-to-all reduction, probably due to the misleading ambiguity.
From what I have learned in my supercomputing class I know that MPI is a communicating (and data passing) interface.
I'm confused on when you run a function in a C++ program and want each processor to perform a specific task.
For example, a prime number search (very popular for supercomputers). Say I have a range of values (531-564, some arbitrary range) and say I have 50 processes I could run a series of evaluations on for each number. If root (process 0) wants to examine 531 and knowing prime numbers I can use 8 processes (1-8) to evaluate the prime status. If the number is divisible by any number 2-9 with a remainder of 0, then it is not prime.
Is it possible that for MPI which passes data to each process to have these processes perform these actions?
The hardest part for me is understanding that if I perform an action in the original C++ program the processes taking place could be allocated on several different processes, then in MPI how can I structure this? Or is my understanding completely wrong? If so how am I supposed to truly go about this path of thinking in a correct manner?
The big idea is passing data to a process versus having a function sent to a process. I'm fairly certain I'm wrong but I'm trying to back track to fix my thinking.
Each MPI process is running the same program, but that doesn't mean that they are doing the same thing. Different processes can be running different branches of the code, depending on the id (or "rank") of the process, and in effect be completely independent. Like any distributed computation, the actors do need to agree on how they will communicate.
The most basic strategy in MPI is scatter-gather, where the "master" process (usually the one with rank 0) will split an array of work equally amongst the peers (including the master process itself) by having them all call scatter, the peers will do the work, then all peers will call gather to send the results back to master.
In your prime algorithm example, build an array of integers, "scatter" it to all the peers, each peer will run through its array saving 1 if it is prime, 0 if it is not then "gather" the results to master. [In this particular example, since the input data is completely predictable based on process rank, the scatter step is unnecessary but we will do it anyway.]
As pseudo-code:
main():
int x[n], n = 100
MPI_init()
// prepare data on master
if rank == 0:
for i in 1 ... n, x[i] = i
// send data from x on root to local on each process in world
MPI_scatter(x, n, int, local, n/k, int, root, world)
for i in 1 ... n/k
result[i] = 1 // assume prime
if 2 divides local[i], result[i] = 0
if 3 divides local[i], result[i] = 0
if 5 divides local[i], result[i] = 0
if 7 divides local[i], result[i] = 0
// gather reults from local on each process in world to x on root
MPI_gather(result, n/k, int, x, n, int, root, world)
// print results
if rank == 0:
for i in 1 ... n, print i if x[i] == 1
MPI_finalize()
There are lots of details to fill in such as proper declarations, and dealing with the fact that some ranks will have fewer elements than others, using
proper C syntax, etc., but getting them right doesn't help explain the overall picture.
More fine-grained synchronization and communication is possible using direct send/recv between processes. Such programs are harder to write since the different processes may be in different states. In particular, it is important that if process a is calling MPI_send to process b, then process b had better be calling MPI_recv from a.